Declarative Logic Programming: Theory, Systems, and Applications 9781970001983

The idea of this book grew out of a symposium that was held at Stony Brook in September 2012 in celebration of David S.W

930 41 3MB

English Pages [617] Year 2018

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Declarative Logic Programming: Theory, Systems, and Applications
 9781970001983

Table of contents :
Contents
Preface
Part I. THEORY
1. Datalog: Concepts, History, and Outlook
2. An Introduction to the Stable and Well-Founded Semantics of Logic Programs
3. A Survey of Probabilistic Logic Programming
Part II. SYSTEMS
4. WAM for Everyone: A Virtual Machine for Logic Programming
5. Predicate Logic as a Modeling Language: The IDP System
6. SolverBlox: Algebraic Modeling in Datalog
Part III. APPLICATIONS
7. Exploring Life: Answer Set Programming in Bioinformatics
8. State-Space Search withTabled Logic Programs
9. Natural Language Processing with (Tabled and Constraint) Logic Programming
10. Logic Programming Applications: What Are the Abstractions and Implementations?
Index
Biographies

Citation preview

Declarative Logic Programming

ACM Books Editor in Chief ¨ zsu, University of Waterloo M. Tamer O ACM Books is a new series of high-quality books for the computer science community, published by ACM in collaboration with Morgan & Claypool Publishers. ACM Books publications are widely distributed in both print and digital formats through booksellers and to libraries (and library consortia) and individual ACM members via the ACM Digital Library platform.

Declarative Logic Programming: Theory, Systems, and Applications Editors: Michael Kifer, Stony Brook University Yanhong Annie Liu, Stony Brook University 2018

The Handbook of Multimodal-Multisensor Interfaces, Volume 2: Signal Processing, Architectures, and Detection of Emotion and Cognition Editors: Sharon Oviatt, Monash University Bj¨ orn Schuller, University of Augsburg and Imperial College London Philip R. Cohen, Monash University Daniel Sonntag, German Research Center for Artificial Intelligence (DFKI) Gerasimos Potamianos, University of Thessaly Antonio Kr¨ uger, German Research Center for Artificial Intelligence (DFKI) 2017

The Sparse Fourier Transform: Theory and Practice Haitham Hassanieh, University of Illinois at Urbana-Champaign 2018

The Continuing Arms Race: Code-Reuse Attacks and Defenses Editors: Per Larsen, Immunant, Inc. Ahmad-Reza Sadeghi, Technische Universit¨at Darmstadt 2018

Frontiers of Multimedia Research Editor: Shih-Fu Chang, Columbia University 2018

Shared-Memory Parallelism Can Be Simple, Fast, and Scalable Julian Shun, University of California, Berkeley 2017

Computational Prediction of Protein Complexes from Protein Interaction Networks Sriganesh Srihari, The University of Queensland Institute for Molecular Bioscience Chern Han Yong, Duke-National University of Singapore Medical School Limsoon Wong, National University of Singapore 2017

The Handbook of Multimodal-Multisensor Interfaces, Volume 1: Foundations, User Modeling, and Common Modality Combinations Editors: Sharon Oviatt, Incaa Designs Bj¨ orn Schuller, University of Passau and Imperial College London Philip R. Cohen, Voicebox Technologies Daniel Sonntag, German Research Center for Artificial Intelligence (DFKI) Gerasimos Potamianos, University of Thessaly Antonio Kr¨ uger, German Research Center for Artificial Intelligence (DFKI) 2017

Communities of Computing: Computer Science and Society in the ACM Thomas J. Misa, Editor, University of Minnesota 2017

Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining ChengXiang Zhai, University of Illinois at Urbana–Champaign Sean Massung, University of Illinois at Urbana–Champaign 2016

An Architecture for Fast and General Data Processing on Large Clusters Matei Zaharia, Stanford University 2016

Reactive Internet Programming: State Chart XML in Action Franck Barbier, University of Pau, France 2016

Verified Functional Programming in Agda Aaron Stump, The University of Iowa 2016

The VR Book: Human-Centered Design for Virtual Reality Jason Jerald, NextGen Interactions 2016

Ada’s Legacy: Cultures of Computing from the Victorian to the Digital Age Robin Hammerman, Stevens Institute of Technology Andrew L. Russell, Stevens Institute of Technology 2016

Edmund Berkeley and the Social Responsibility of Computer Professionals Bernadette Longo, New Jersey Institute of Technology 2015

Candidate Multilinear Maps Sanjam Garg, University of California, Berkeley 2015

Smarter Than Their Machines: Oral Histories of Pioneers in Interactive Computing John Cullinane, Northeastern University; Mossavar-Rahmani Center for Business and Government, John F. Kennedy School of Government, Harvard University 2015

A Framework for Scientific Discovery through Video Games Seth Cooper, University of Washington 2014

Trust Extension as a Mechanism for Secure Code Execution on Commodity Computers Bryan Jeffrey Parno, Microsoft Research 2014

Embracing Interference in Wireless Systems Shyamnath Gollakota, University of Washington 2014

Declarative Logic Programming Theory, Systems, and Applications

Michael Kifer Stony Brook University

Yanhong Annie Liu Stony Brook University

ACM Books #20

Copyright © 2018 by the Association for Computing Machinery and Morgan & Claypool Publishers All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations in printed reviews—without the prior permission of the publisher. Designations used by companies to distinguish their products are often claimed as trademarks or registered trademarks. In all instances in which Morgan & Claypool is aware of a claim, the product names appear in initial capital or all capital letters. Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration. Declarative Logic Programming: Theory, Systems, and Applications Michael Kifer, Yanhong Annie Liu, editors books.acm.org www.morganclaypoolpublishers.com ISBN: 978-1-97000-199-0 ISBN: 978-1-97000-196-9 ISBN: 978-1-97000-197-6 ISBN: 978-1-97000-198-3

hardcover paperback eBook ePub

Series ISSN: 2374-6769 print 2374-6777 electronic DOIs: 10.1145/3191315 Book 10.1145/3191315.3191316 10.1145/3191315.3191317 10.1145/3191315.3191318 10.1145/3191315.3191319 10.1145/3191315.3191320 10.1145/3191315.3191321

Preface Chapter 1 Chapter 2 Chapter 3 Chapter 4 Chapter 5

10.1145/3191315.3191322 10.1145/3191315.3191323 10.1145/3191315.3191324 10.1145/3191315.3191325 10.1145/3191315.3191326 10.1145/3191315.3191327

A publication in the ACM Books series, #20 ¨ zsu, University of Waterloo Editor in Chief: M. Tamer O Area Editor: Yanhong Annie Liu, Stony Brook University This book was typeset in Arnhem Pro 10/14 and Flama using ZzTEX. First Edition 10 9 8 7 6 5 4 3 2 1

Chapter 6 Chapter 7 Chapter 8 Chapter 9 Chapter 10 Index

To David Scott Warren, for his groundbreaking work on principles and systems for logic programming

Contents Preface xvii

PART I THEORY 1 Chapter 1

Datalog: Concepts, History, and Outlook 3 David Maier, K. Tuncay Tekle, Michael Kifer, David S. Warren 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

Chapter 2

Introduction 3 The Emergence of Datalog 10 Coining “Datalog” 15 Extensions to Datalog 16 Evaluation Techniques 46 Early Datalog and Deductive Database Systems 65 The Decline and Resurgence of Datalog 73 Current Systems and Comparison 84 Conclusions 98 Acknowledgments 100 References 100

An Introduction to the Stable and Well-Founded Semantics of Logic Programs 121 Miroslaw Truszczynski 2.1 2.2 2.3 2.4

Introduction 121 Terminology, Notation, and Other Preliminaries 123 The Case of Horn Logic Programs 131 Moving Beyond Horn Programs—An Informal Introduction 135

xii

Contents

2.5 2.6 2.7

Chapter 3

The Stable Model Semantics 140 The Well-Founded Model Semantics 166 Concluding Remarks 176 Acknowledgments 177 References 177

A Survey of Probabilistic Logic Programming 185 Fabrizio Riguzzi, Theresa Swift 3.1 3.2 3.3 3.4 3.5 3.6 3.7

Introduction 186 Languages with the Distribution Semantics 188 Defining the Distribution Semantics 192 Other Semantics for Probabilistic Logics 210 Probabilistic Logic Programs and Bayesian Networks 215 Inferencing in Probabilistic Logic Programs 219 Discussion 227 Acknowledgments 228 References 228

PART II SYSTEMS 235 Chapter 4

WAM for Everyone: A Virtual Machine for Logic Programming 237 David S. Warren 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11

Introduction 237 The Run-Time Environment of a Traditional Procedural Language 239 Deterministic Datalog 244 Deterministic Prolog 256 Nondeterministic Prolog 263 Last Call Optimization 268 Indexing 270 Environment Trimming 272 Features Required for Full Prolog 273 WAM Extensions for Tabling 274 Concluding Remarks 275 Acknowledgments 276 References 277

Contents

Chapter 5

Predicate Logic as a Modeling Language: The IDP System 279 Broes De Cat, Bart Bogaerts, Maurice Bruynooghe, Gerda Janssens, Marc Denecker 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9

Chapter 6

Introduction 279 FO(ID, AGG, PF, T), the Formal Base Language 284 IDP as a Knowledge Base System 294 The IDP Language 297 Advanced Features 304 Under the Hood 308 In Practice 320 Related Work 321 Conclusion 323 References 323

SolverBlox: Algebraic Modeling in Datalog 331 Conrado Borraz-S´anchez, Diego Klabjan, Emir Pasalic, Molham Aref 6.1 6.2 6.3 6.4 6.5 6.6

Introduction 331 Datalog 333 LogicBlox and LogiQL 335 Mathematical Programming with LogiQL 339 The Traveling Salesman Problem (TSP) Test Case 348 Conclusions and Future Work 353 References 354

PART III APPLICATIONS 357 Chapter 7

Exploring Life: Answer Set Programming in Bioinformatics 359 Alessandro Dal Pal`u, Agostino Dovier, Andrea Formisano, Enrico Pontelli 7.1 7.2 7.3 7.4 7.5 7.6

Introduction 359 Biology in a Nutshell 362 Answer Set Programming in a Nutshell 366 Phylogenetics 370 Haplotype Inference 379 RNA Secondary Structure Prediction 384

xiii

xiv

Contents

7.7 7.8 7.9 7.10

Chapter 8

Protein Structure Prediction 389 Systems Biology 393 Other Logic Programming Approaches 404 Conclusions 406 Acknowledgments 407 References 412

State-Space Search with Tabled Logic Programs 427 C. R. Ramakrishnan 8.1 8.2 8.3 8.4 8.5

Chapter 9

Introduction 427 Finite-State Model Checking 430 Infinite-State Model Checking 440 Simple Planning via Tabled Search 458 Discussion 469 Acknowledgments 471 References 472

Natural Language Processing with (Tabled and Constraint) Logic Programming 477 Henning Christiansen, Ver´onica Dahl 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9

Chapter 10

Introduction 477 Tabling, LP, and NLP 481 Tabled Logic Programming and Definite Clause Grammars 484 Using Extra Arguments for Linguistic Information 488 Assumption Grammars: DCGs Plus Global Memory 492 Constraint Handling Rules and Their Application to Language Processing 498 Hypothetical Reasoning with CHR and Prolog: Hyprolog 504 A Note on the Usefulness of Probabilistic Logic Programming for Language Processing 508 Conclusion 509 References 512

Logic Programming Applications: What Are the Abstractions and Implementations? 519 Yanhong A. Liu 10.1 Introduction 519 10.2 Logic Language Abstractions 521

Contents

10.3 10.4 10.5 10.6 10.7

Join and Database-Style Queries 526 Recursion and Inductive Analysis 532 Constraint and Combinatorial Search 536 Further Extensions, Applications, and Discussion 541 Related Literature and Future Work 546 Acknowledgments 547 References 548

Index 559 Biographies 589

xv

Preface The idea of this book grew out of a symposium that was held at Stony Brook in September 2012 in celebration of David S. Warren’s fundamental contributions to Computer Science and the area of Logic Programming in particular. Logic Programming (LP) is at the nexus of Knowledge Representation, Artificial Intelligence, Mathematical Logic, Databases, and Programming Languages. It is fascinating and intellectually stimulating due to the fundamental interplay among theory, systems, and applications brought about by logic. Logic programs are more declarative in the sense that they strive to be logical specifications of “what” to do rather than “how” to do it, and thus they are high-level and easier to understand and maintain. Yet, without being given an actual algorithm, LP systems implement the logical specifications automatically. Several books cover the basics of LP but focus mostly on the Prolog language with its incomplete control strategy and non-logical features. At the same time, there is generally a lack of accessible yet comprehensive collections of articles covering the key aspects in declarative LP. These aspects include, among others, well-founded vs. stable model semantics for negation, constraints, object-oriented LP, updates, probabilistic LP, and evaluation methods, including top-down vs. bottom-up, and tabling. For systems, the situation is even less satisfactory, lacking accessible literature that can help train the new crop of developers, practitioners, and researchers. There are a few guides on Warren’s Abstract Machine (WAM), which underlies most implementations of Prolog, but very little exists on what is needed for constructing a state-of-the-art declarative LP inference engine. Contrast this with the literature on, say, Compilers, where one can first study a book on the general principles and algorithms and then dive in the particulars of a specific compiler. Such resources greatly facilitate the ability to start making meaningful contributions quickly. There is also a dearth of articles about systems that support truly declarative languages, especially those that tie into first-order logic, mathematical programming, and constraint solving.

xviii

Preface

LP helps solve challenging problems in a wide range of application areas, but in-depth analysis of their connection with LP language abstractions and LP implementation methods is lacking. Also, rare are surveys of challenging application areas of LP, such as Bioinformatics, Natural Language Processing, Verification, and Planning. The goal of this book is to help fill in the previously mentioned void in the LP literature. It offers a number of overviews on key aspects of LP that are suitable for researchers and practitioners as well as graduate students. The following chapters in theory, systems, and applications of LP are included.

Part I: Theory 1. “Datalog: Concepts, History, and Outlook” by David Maier, K. Tuncay Tekle, Michael Kifer, and David S. Warren. This chapter is a comprehensive survey of the main concepts of Datalog, an LP language for data processing. The study of Datalog was one of the early drivers of more declarative approaches to LP. Some aspects of Datalog are covered in greater depth than others, but the bibliography at the end of the chapter can be relied upon to help fill in the details. 2. “An Introduction to the Stable and Well-Founded Semantics of Logic Programs” by Miroslaw Truszczynski. The theory underlying modern LP is based on two different logical semantics: the stable model semantics and the well-founded semantics, leading to two different declarative paradigms. The stable model semantics defines the meaning of an LP program as a set of two-valued models and is a leading approach in LP to solving combinatorial problems. The well-founded semantics always yields a single, possibly three-valued, model and is popular with knowledge representation systems that focus on querying. This chapter provides a rigorous yet accessible introduction to both semantics. 3. “A Survey of Probabilistic Logic Programming” by Fabrizio Riguzzi and Theresa Swift. Integration of probabilistic and logic reasoning is of great importance to modern Artificial Intelligence. This chapter provides a uniform introduction, based on what is called distribution semantics, to a number of leading approaches to such integration.

Part II: Systems 4. “WAM for Everyone: A Virtual Machine for Logic Programming” by David S. Warren.

Preface

xix

This chapter is an introduction to Warren’s Abstract Machine (WAM), the primary technology underlying modern LP systems. Unlike previous expositions of WAM, this chapter also describes tabling, one of the main breakthroughs that occurred since D.H.D. Warren developed the original WAM in 1970’s. 5. “Predicate Logic as a Modeling Language: The IDP System” by Broes De Cat, Bart Bogaerts, Maurice Bruynooghe, Gerda Janssens, and Marc Denecker. IDP is a language and system that supports classical predicate logic extended with inductive definitions as rules, and with aggregation, functions, and types. It promotes a declarative semantics of logic programs and allows a single logical specification to be used in multiple different reasoning tasks to solve a range of problems. This chapter gives an overview of the IDP language and system. 6. “SolverBlox: Algebraic Modeling in Datalog” by Conrado Borraz-S´ anchez, Diego Klabjan, Emir Pasalic, and Molham Aref. LogiQL extends Datalog with aggregation, reactive rules, and constraints to support transactions in a database and software development platform called LogicBlox. This chapter introduces LogiQL and shows how it can be used to specify and solve mixed-integer linear optimization problems by adding simple declarations.

Part III: Applications 7. “Exploring Life: Answer Set Programming in Bioinformatics” by Alessandro Dal Pal` u, Agostino Dovier, Andrea Formisano, and Enrico Pontelli. This chapter surveys applications of declarative LP in Bioinformatics, including in Genomics studies, Systems Biology, and Structural studies. The necessary biological background is provided when necessary. The applications selected for this survey all rely on the Answer Set Programming paradigm. 8. “State-Space Search with Tabled Logic Programs” by C. R. Ramakrishnan. Model checking and planning are two important and challenging classes of problems that require extensive search through the possible states of systems. This chapter gives an overview of both types of problems as applications of LP. It focuses on tabling as an effective technique for improving the performance of state space search. 9. “Natural Language Processing with (Tabled and Constraint) Logic Programming” by Henning Christiansen and Ver´ onica Dahl.

xx

Preface

Natural language processing (NLP) was the original motivation for the development of Prolog, the most popular logic-based language. This chapter is an overview of the applications of LP to NLP, primarily based on definite clause grammars. 10. “Logic Programming Applications: What Are the Abstractions and Implementations?” by Yanhong A. Liu. LP is used in many areas and contexts but the applicability of LP needs to be understood in more fundamental ways. This chapter gives an overview of LP applications, classifying them based on three key abstractions and their corresponding implementations. The abstractions are join, recursion, and constraint. The corresponding implementations are for-loops, fixed points, and backtracking.

Acknowledgments Congratulations to the chapter authors for the well-thought-out surveys—all written specifically for this book. Each contribution was carefully reviewed, with at least four reviews per chapter. We are deeply grateful to the reviewers for the time and effort dedicated to perfecting the book material. Many thanks to the Editor-in-Chief ¨ zsu, for his support, and to Diane Cerra of Morgan of ACM Books series, M. Tamer O & Claypool for her help, guidance, and patience throughout the process. We were lucky to have an expert production team: Paul Anagnostopoulos of Windfall Software, who masterfully typeset this book; Sara Kreisman of Rambling Rose Press, who copyedited the pesky typos out of existence; Brent Beckley of Morgan & Claypool, who designed the beautiful and artsy cover; and Christine Kiilerich, also of Morgan & Claypool, who helped with many publication tasks. We also acknowledge the support of NSF under grants CCF-0964196, CCF1248184, CCF-1414078, and IIS-1447549; and of ONR under grant N000141512208.

I PART

THEORY

1

Datalog: Concepts, History, and Outlook David Maier, K. Tuncay Tekle, Michael Kifer, David S. Warren

This chapter is a survey of the history and the main concepts of Datalog. We begin with an introduction to the language and its use for database definition and querying. We then look back at the threads from logic languages, databases, artificial intelligence, and expert systems that led to the emergence of Datalog and reminiscence about the origin of the name. We consider the interaction of recursion with other common data language features, such as negation and aggregation, and look at other extensions, such as constraints, updates, and object-oriented features. We provide an overview of the main approaches to Datalog evaluation and their variants, then recount some early implementations of Datalog and of similar deductive database systems. We speculate on the reasons for the decline in the interest in the language in the 1990s and the causes for its later resurgence in a number of application areas. We conclude with several examples of current systems based on or supporting Datalog and briefly examine the performance of some of them.

1.1

Introduction Datalog has had a tumultuous history during which interest in it waxed and waned and waxed again. The name was originally coined to designate a simplified Hornlogic language akin to Prolog, but has since come to identify research on deductive databases and recursive query processing. Given the more than three decades since the introduction of Datalog, the time is right to explore its roots, recount early work on it, try to understand its declining fortunes in the 1990s, and examine the recent resurgence of interest in the language. This chapter is not intended to be a tutorial on Datalog nor a comprehensive research survey; rather, it is a recounting

4

Chapter 1 Datalog: Concepts, History, and Outlook

of developments in the history of Datalog, with some personal interpretation here and there. Before examining the roots of Datalog, we review the basics of the language. We begin with a sample database that we employ throughout, followed by some examples of a Datalog program and queries that use that database. The database contains information about doctoral theses and advisors.1 We use a schema with four relations to represent academic-descendant information: person(ID, First, Last) area(Area, AreaLongName) thesis(PID, Univ, Title, Year , Area) advised(AID, PID)

The person relation provides an identifier for each person involved, along with the person’s name. The relation area provides areas and long descriptions of the different areas for theses. The thesis relation gives details on a person’s thesis, while advised captures the information about the advisor(s) for a person. (One can have multiple advisors if they jointly advised that person or if the person obtained multiple Ph.D. degrees.) A Datalog program typically consists of facts and rules. A fact asserts that a particular tuple belongs to a relation. From a logical viewpoint, it represents a predicate being true for a particular combination of values. Below are some Datalog facts corresponding to the thesis schema. An alphanumeric value starting with a lowercase letter is a unique symbol, not equal to any other value. If a symbol starts with a capital letter or if it contains non-alphanumeric characters (e.g., a space) then it is enclosed in single quotes. The year is represented here as an integer. From the data below we see that David Warren completed his Ph.D. thesis on Montague grammars at Michigan in 1979, under the direction of Joyce Friedman and William Rounds. person(dsw, ’David’, ’Warren’). person(jbf, ’Joyce’, ’Friedman’). person(hw, ’Hao’, ’Wang’). person(wvoq, ’Willard’, ’Quine’). person(anw, ’Alfred’, ’Whitehead’). 1. This example is inspired by the Mathematics Genealogy Project, http://genealogy.math.ndsu .nodak.edu/, which is also the source of most of the data used.

1.1 Introduction

5

person(wcr, ’William’, ’Rounds’). person(dss, ’Dana’, ’Scott’). person(ac, ’Alonzo’, ’Church’). person(lvk, ’Laxmikant’, ’Kale’). person(jmy, ’Joshua’, ’Yelon’). person(fg, ’Filippo’, ’Gioachin’). person(ks, ’Konstantinos’, ’Sagonas’). person(ek, ’Erik’, ’Stenman’). area(cs, ’Computer Science’). area(lg, ’Logic’). thesis(dsw, ’Michigan’, ’Montague Grammars’, 1979, cs). thesis(jbf, ’Harvard’, ’Decision Procedure’, 1965, cs). thesis(hw, ’Harvard’, ’Economical Ontology’, 1948, lg). advised(jbf, dsw). advised(wcr, dsw). advised(hw, jbf). advised(wvoq, hw). advised(anw, wvoq). advised(dss, wcr). advised(ac, dss). advised(dsw, lvk). advised(lvk, jmy). advised(lvk, fg). advised(dsw, ks). advised(ks, ek).

A Datalog rule is similar to a view definition in relational database systems. It states that if certain tuples exist in specific relations, then an additional tuple is assumed to exist in a “virtual” or derived relation. It can also be considered an inference rule for deducing new facts from existing ones. A rule has a head followed by a body, separated by the symbol :−, which is intended to resemble a left-pointing implication arrow. The head indicates the tuple being defined, and the body is a comma-separated list of facts, in which the commas are interpreted as “and”. A rule generally contains one or more logical variables, which begin with uppercase letters.2 Note that, under these conventions, First would be a variable, while ’First’ a constant symbol. The rule asserts that the implication holds for all 2. Another popular convention in Datalog systems is to prefix variables with the question mark, e.g., ?First, ?Area, ?Year.

6

Chapter 1 Datalog: Concepts, History, and Outlook

possible substitutions of the logical variables. For example, suppose we want to define a relation with certain information about people who received a Ph.D. in the area of Computer Science. We could use the following Datalog rule: csInfo(First,Last,Univ,Year) :thesis(PID,Univ,Title,Year,Area), person(PID,First,Last), area(Area,’Computer Science’).

One way to understand a Datalog rule is as standing for all ground instances of the rule obtained by consistently substituting constants for the logical variables. Two such instances of the rule above are csInfo(’Joyce’,’Friedman’,’Harvard’,1965) :thesis(jbf,’Harvard’,’Decision Procedure’,1965,cs), person(jbf,’Joyce’,’Friedman’), area(cs,’Computer Science’). csInfo(’Joyce’,’Hansen’,’Harvard’,1965) :thesis(jbf,’Harvard’,’Decision Procedure’,1965,cs), person(jbf,’Joyce’,’Hansen’), area(cs,’Computer Science’).

The effect of a rule is to extend the database by the fact in the head for every such ground instance whenever all the facts in the body have been previously established—either because they are given explicitly as facts or because they were obtained from an instance of some rule using the same reasoning. Considering the two instances above, the first establishes the fact csInfo(’Joyce’,’Friedman’,’Harvard’,1965)

as all the facts in the body are known to be true. However, the second rule instance does not establish the fact csInfo(’Joyce’,’Hansen’,’Harvard’,1965)

as the fact person(jbf,’Joyce’,’Hansen’) is not one listed in this example program (nor can it be established by the rules above). Another way to regard this rule is as a view definition, which we could write in SQL3 as 3. We assume in this chapter a basic knowledge of relational database systems, such as relational algebra and the SQL query language. The necessary background would be included in any introductory database text, such as Garcia-Molina et al. [2009].

1.1 Introduction

7

CREATE VIEW csInfo AS SELECT p.First, p.Last, t.Univ, t.Year FROM thesis t, person p, area a WHERE t.PID = p.ID AND t.Area = a.Area AND a.Long = ’Computer Science’;

Note that there is a difference in Datalog and SQL in referring to attributes: Datalog relies on positional notation, while SQL has explicit naming. Also, in SQL variables range over tuples while in Datalog they range over individual domain elements. For those familiar with relational database theory, the difference is essentially that between the domain and tuple relational calculi. Datalog allows multiple rules with the same predicate in the head, in which case the database is extended by the facts established by any of the rules. For example, if we wanted to derive all people in an adjacent generation to a given person (that is, the person’s direct advisors or advisees), we could use the two rules adjacent(P1, P2) :- advised(P1, P2). adjacent(P1, P2) :- advised(P2, P1).

We can view a Datalog program as describing a database, with some tuples given by facts and others established by rules. There are various conventions for expressing a query and the result against that database. For the moment, we will use a single predicate pattern, with zero or more variables, preceded by ?-, as a query. The result of the query will be the set of all matching facts in the database specified by the Datalog program. Thus, if our program is all the previous facts and rules, then the query ?- adjacent(dsw, P2).

has the result set adjacent(dsw, adjacent(dsw, adjacent(dsw, adjacent(dsw,

lvk). ks). jbf). wcr).

where the first two tuples are established by the first adjacent rule, and the other two by the second. A query with no variables functions as a membership test, returning a singleton set if the corresponding tuple is in the database of the Datalog program, and an empty set if not. So the query ?- adjacent(dsw, jbf).

8

Chapter 1 Datalog: Concepts, History, and Outlook

returns the singleton answer adjacent(dsw, jbf).

but the query ?- adjacent(dsw, hw).

returns an empty result. While the two rules for adjacent can be captured by a view using UNION in SQL, Datalog can express virtual relations that SQL cannot, via recursion in rules.4 Suppose we want to consider two people “related” if they have a common academic ancestor. We can describe this relation with three rules: related(P1, P2) :- advised(A, P1), advised(A, P2). related(P1, P2) :- advised(B, P1), related(B, P2). related(P1, P2) :- advised(C, P2), related(P1, C).

We see that two of the rules establish related-facts recursively, from other relatedfacts. So, for example, we can use the following instance of the first rule related(lvk, ks) :- advised(dsw, lvk), advised(dsw, kv).

to establish the fact related(lvk, ks), then use that fact with the following instance of the second rule related(jmy, kv) :- advised(lvk, jmy), related(lvk, jmy).

to establish the fact related(jmy, kv). Note that while a newly established related-fact might be able to establish a further related-fact, the process must eventually stop. Every related-fact involves person IDs from the original advisedfacts, and there are a finite number of pairs of such IDs. More generally, we view the meaning of a Datalog program P consisting of facts and rules as the database of ground facts obtained starting with the original facts and applying the rules to derive additional facts until there are no changes. In this context it is useful to view P as a transformation (sometimes called the immediate consequence operator) that takes a set of ground facts as input and augments them with the additional facts that can be established with a single application of any fact or rule in P . If we start with an empty input, the first “application” of P adds all the facts in P . A second application adds all facts that can be established with one rule application, a third application adds facts established with two rule applications, 4. SQL:1999 introduced support for the comparatively limited form of linear recursion.

1.1 Introduction

9

and so on. Eventually this process converges when some application of P adds no new facts. Thus, we can view the meaning of P as the least fixed point [Lloyd 1993, Van Emden and Kowalski 1976] under application of P ’s rules and facts. It turns out that, for Datalog, that the meaning for P is also the least model for P , in the sense that it is the minimal set of facts that includes all the facts for P and makes every ground instance of a rule in P true. (We will see later that this equivalence must be adjusted in order to handle negation.) We note that the process of repeated applications of P as described above is actually a viable procedure for computing the meaning of a Datalog program, and is referred to as bottom-up evaluation. A query is then matched against the final set of established facts to find its answers. An alternative approach is top-down evaluation, which starts with a particular query, and tries to match it with a fact or rule head in P . If it matches a fact, then a new answer has been established. If it matches a rule head, the rule body becomes the new query, and the process continues, trying to match the goals in the rule body. For example, the query ?- adjacent(dsw, P2).

does not match any facts directly, but it does match the head of the rule adjacent(P1, P2) :- advised(P1, P2).

if we replace P1 by dsw. The body of the rule leaves the new query ?- advised(dsw, P2).

which matches the two facts advised(dsw, lvk). advised(dsw, ks).

resulting in two answers: adjacent(dsw, lvk).

Matching the original query with the other adjacent rule gives two additional answers. We will cover evaluation techniques in more detail in Section 1.5. Three final points before we leave this overview of Datalog. First, it is common to classify the tuples in the database specified by a Datalog program into two parts. The extensional database, or EDB, is all the tuples corresponding to explicit facts in the program, for example, advised(dsw, lvk). The intensional database, or IDB, is all the tuples established by the rules in the program, for example, related(lvk, ks). While the same relation could draw tuples from both the EDB and the IDB,

10

Chapter 1 Datalog: Concepts, History, and Outlook

many authors assume each relation to be strictly extensional or intensional. This assumption is not really a limitation, since a Datalog program where it does not hold is easily rewritten into one where it does and that specifies the same database. The second is a restriction usually placed on Datalog programs that the rules be safe, in the sense that any fact in the IDB contains only the values that appear in the EDB or the rules themselves. For example, the rule badPred(P1, P2) :- advised(P1, P3).

is not safe, because the value of P2 is unconstrained and therefore P2 can be bound to any value, not necessarily the ones that appear in EDB. Generally, safety is guaranteed by a syntactic condition, such as requiring that every variable in the rule head also appear in some predicate in the rule body. The third point is a notational convenience. The first Datalog rule could be written more succinctly as: csInfo(F, L, U, Y) :thesis(P, U, _, Y, T), person(P, F, L), area(T, ’Computer Science’).

In this formulation, since the actual names of the variables do not matter, logical variables are consistently renamed with 1-letter variable names. Furthermore, inventing names for the variables that occur only once in a rule is an unnecessary burden and the underscore is often used to denote “anonymous” variables. Each occurrence of “_” is treated as a new variable name, which the parser generates automatically.

1.2

The Emergence of Datalog Since Datalog is a subset of Prolog, one could create and execute Datalog programs as soon as Prolog evaluators existed—around 1972. However, as a named concept and object of study, Datalog emerged in the mid-1980s, following work in deductive databases in the late 1970s. (See Section 1.3 for more on the naming of Datalog.) Why did Datalog emerge as a topic of interest at that point? It was because Datalog served as a “sweet spot”—or “middle ground”—for related research lines from Logic Programming, Database Systems, and Artificial Intelligence. Why was Datalog interesting to the three communities? It was because pure Datalog was very simple and had a clean syntax and semantics. Yet it was expressive enough to serve as the basis for theoretical investigations and examination of

1.2 The Emergence of Datalog

11

evaluation alternatives, as well as a foundation from which extensions could be explored and a starting point for knowledge-representation systems. We summarize relevant trends in each of the three communities below, following each by more specific background. Logic Programming Logic programmers saw relational databases as an implementation of an important sublanguage, and worked to integrate them into Prolog systems. Since early Prolog implementations assumed all rules and facts were memory-resident, it was clear that for very large fact bases, something like relational database technology was necessary. Sometimes this enhancement took the form of a connection to a relational database management system (RDBMS). Translators were written to convert a subset of Prolog programs into SQL to be passed to an RDBMS and evaluated there. Datalog was a more powerful, more Prolog-ish, data-oriented subset of Prolog than SQL. Datalog was also more “declarative” than Prolog and many in the Logic Programming community liked that aspect. With Prolog, there was a serious gap between Logic Programming theory (programs understood as logical axioms and deduction) and Logic Programming practice (programs as evaluated by Prolog interpreters). For one thing, Prolog contained a number of non-logical primitives; for another, many perfectly logical Prolog programs would not terminate due to the weaknesses of the evaluation strategy used by the Prolog interpreters. With Datalog, that gap almost disappeared—deduction largely coincided with evaluation in terms of the produced results. Also, Datalog provided a basis on which to work out solutions for recursion with negation, some of which could be applied to more general logic languages. Background. Logic programming grew out of resolution theorem proving pro-

posed by Robinson [1965] and, especially, out of a particularly simplified version of it, called SLD resolution [Lloyd 1993, Kowalski 1974], which worked for special cases of logic, such as Horn clauses.5 In the early 1970s, researchers began to realize that SLD resolution combined with backtracking provided a computational model. In particular, Colmerauer and collaborators developed Prolog [Colmerauer and Roussel 1996] and Kowalski made significant contributions to the theory of logic programming, as it came to be called [Kowalski 1988]. Prolog was the starting point for languages in the Japanese Fifth-Generation Computer Systems (FGCS) 5. A Horn clause is a logical implication among positive literals with at most one literal in the consequent.

12

Chapter 1 Datalog: Concepts, History, and Outlook

project in the early 1980s [Fuchi and Furukawa 1987, Moto-oka and Stone 1984]. The FGCS project sought advances in hardware, databases, parallel computing, deduction and user interfaces to build high-performance knowledge-base systems. In the context of this project, Fuchi [1981] describes Prolog as a basis for bringing together programming languages and database query languages. D. H. D. Warren [1982b] provides interesting insights into the focus on Prolog for the FGCS. The use of Prolog to express queries dates from around the same time. For example, the Chat-80 system [Warren and Pereira 1982] analyzed natural-language questions and turned them into Prolog clauses that could be evaluated against stored predicates. (All examples there are actually in Datalog.) Warren [1981] showed that such queries were amenable to some database-style optimizations via Prolog rewriting and annotation. Prolog itself had been proposed for database query. For example, Maier [1986b] considers Prolog as a database query language and notes its advantages— avoiding the “impedance mismatch” between DBMS and programming language, expressive power, ease of transformation—but also points out its limits in terms of data definition, update, secondary storage, concurrency control and recovery. Zaniolo [1986] also notes that Prolog can be used to write complete database applications, avoiding the impedance mismatch. He further proposes extensions to Prolog for use with a data model supporting entity identity. For a recent survey of the history of logic programming, see Kowalski [2014]. Database Systems As the relational model gained traction in the 1980s, limitations on expressiveness of the query languages became widely recognized by researchers and practitioners. In particular, fairly common applications—such as the transitive closure of a graph and bill-of-materials roll ups (i.e., aggregation of costs, weights, etc. in a partsubpart hierarchy)—could not be expressed with a single query in most relational query languages. Various approaches for enhancing expressiveness to handle such cases were proposed, such as adding control structures or a fixpoint operator to relational algebra. Datalog was a simple alternative that was similar to domain relational calculus, with which the database theory community was familiar. Thus, it was readily understood, and provided a natural setting in which to study topics such as deductive databases, recursion, its interaction with negation, and evaluation techniques. Much of the early discussion and presentations on Datalog took place at the informal“XP” workshops6 (particularly XP 4.5 in 1983 and XP 7.52 in 1986) and the early symposia on Principals of Database Systems (PODS), which were 6. A list of these workshops can be found in http://dblp.uni-trier.de/db/conf/xp/index.html

1.2 The Emergence of Datalog

13

a follow-up to the XP workshops to some extent. Work on Datalog also highlighted the difference between query answering as evaluation in a model vs. deduction in a theory. Background. It was recognized early on that there were recursive queries express-

ible neither in relational algebra nor relational calculus. For example, Aho and Ullman [1979] prove the inexpressibility of transitive closure by finite relational expressions, and consider various extensions to handle it, such as a least-fixed-point operator and embedding in a host programming language. Paredaens [1978] and Bancilhon [1978] also explore this issue. The PROBE system supported traversal recursion over directed graphs [Rosenthal et al. 1986]. The connection of logic and databases predates the relational model. For example, Green and Raphael [1968] uses theorem proving as the basis for the questionanswering system QA1 that can “deduce facts that are not explicitly available in its data base.” The 1970s was an active time for investigating the connections between logic and databases, such as model-theoretic vs. proof-theoretic views of a database [Nicolas and Gallaire 1977] and “closed-world” vs. “open-world” assumptions about the information in a database [Reiter 1977b]. It was also a time when prototype deductive databases based on logic began appearing, such as MRRPS 3.0 [Minker 1977], DADM [Kellogg et al. 1977], and DEDUCE 2 [Chang 1977]. While many researchers of deductive databases focused on function-free logic, Reiter [1977a] was explicit in his opinion that function-free logic “approximates [his] own intuitive concept of what should be a database,” for otherwise “any first-order theory is a database,” such as point-set topology. For a history of deductive databases, see Minker et al. [2014]. Artificial Intelligence The use of logic and deduction as a basis for question-answering and reasoning in expert systems dates to at least the late 1960s. Rule-based systems were also a common approach to AI problem solving. Logic languages such as Prolog were attractive to this community because they encompassed both basic information and rule-based “intelligence” to work with that knowledge in a uniform model, while providing a formal foundation for rule-based reasoning. There were other attractions, such as the natural use of meta-programming features to manipulate programs and implement alternative evaluation systems. Prolog rules seemed an accessible means for domain experts (who were assumed not to be sophisticated programmers) to directly capture their reasoning strategies. Furthermore, the resolution-based deduction methods used with Prolog were at once a close analog

14

Chapter 1 Datalog: Concepts, History, and Outlook

of human reasoning and an efficient evaluation mechanism. By the early 1980s, there was interest in working with large fact bases in expert systems, and imbuing database systems with intelligence, manifested in the closely related areas of Knowledge-Based Systems (KBS) and Expert Database Systems (EDS). Some saw the function-free subset of Prolog as a happy medium between databases and general rule-based reasoning. Datalog became a common basis for work in the KBS and especially the EDS communities. Background. As mentioned, one of the earliest proposals to use theorem proving

as a mechanism for query answering was that of Green [1969]. The 1970s saw a proliferation of expert systems that tried to replicate human expertise in computational form [Puppe 1993]. These early systems tended to be ad hoc, with much of their knowledge encoded procedurally. Toward 1980, KBS emerged as an architectural approach to make expert systems (and other reasoning applications) easier to construct and maintain, especially when involving large collections of information [Davis 1986]. In a KBS, there is a separation of knowledge structures and the computational mechanism to apply that knowledge. These two parts are often called the knowledge base and the inference engine. (The inference engine did not necessarily use logical inference. It could, for example, make use of probabilistic reasoning.) The knowledge base contains both concrete information (facts) and more abstract forms of knowledge, such as rules, templates, or classification hierarchies. Logic was the representation for the knowledge base in some KBS, although there were competing approaches, such as frame-based systems [Minsky 1975] and semantic networks [Findler 1979]. (It is interesting to note that there was some debate at the time as to the suitability of logic for this role [Hayes 1977, Hayes 1980, Winograd 1975].) The logic-based KBS often structured the knowledge base in the form of facts and rules, similar to a logic program. For example, the DLOG system [Goebel 1985] was a KBS using logic, and its implemented subset consisted of facts and Horn clauses that were passed to a Prolog-based interpreter. Another example of a logic-based framework for KBS was the Syllog system [Fellenstein et al. 1985], which had a database and Prolog-style rules, but presented a structured natural-language interface to them. In a similar vein to KBS, Expert Database Systems (EDS) were an effort to imbue database systems with richer representation and reasoning capabilities. Logic (usually in the guise of logic programming) was often the choice for providing capabilities such as classification hierarchies [Dahl 1982] and incorporating constraints into query answering [Dahl 1986, Kifer and Li 1988]. Kerschberg [1990] provides an overview of EDS.

1.3 Coining “Datalog”

15

1.2.1 Uptake While Datalog had precursors from Logic Programming, Database Systems, and Artificial Intelligence, the bulk of the early work on it took place in the database theory and query-language communities. Deductive-database researchers showed some interest, but their focus was more on removing non-declarative features— such as cut—from Prolog while retaining top-down, SLD-resolution approaches to evaluation [Minker et al. 2014]. On the database-query side, however, there was strong interest in adding rules and recursion to existing query frameworks, which mainly employed bottom-up techniques based on relational algebra for evaluation. Since Datalog did not have function symbols, a safe program implied a finite number IDB facts from a finite EDB, meaning that bottom-up techniques would converge. Much early work on implementation techniques for Datalog and similar languages was based on bottom-up approaches, and researchers found ways to adapt these approaches to support additional features, such as complex objects, aggregation, and negation. These extensions motivated new semantic constructs, such as stratification and stable models for negation and other non-monotone extensions. Therefore, it should be no surprise that the majority of sources cited in this chapter appeared in database venues, through there is a significant body of references from logic programming and AI publications.

1.3

Coining “Datalog” To the best of our knowledge, the name “Datalog” as applied to function-free Prolog was coined while two of the authors of this chapter, David Maier and David S. Warren, were working on the book Computing with Logic [Maier and Warren 1988]. The book used simplified logics and languages to introduce concepts such as substitution and resolution. The simplest logic was propositional Horn clauses, and the obvious name for the corresponding language was “Proplog.” We wanted to introduce unification and specialization of clauses using predicate Horn logic without function symbols, to avoid some of more complicated bits of Prolog initially, such as the occurs check and structure sharing. Maier and Warren kicked around ideas for a name for the corresponding language, but came up with nothing exciting at first. The next day, Maier came back with “Datalog,” presumably because predicates without function symbols looked a lot like database relations. At the time, several researchers had already proposed function-free, definiteclause logic as a basis for deductive databases. While Maier and Warren were not thinking about “Datalog” as the name for a database language, it quickly

16

Chapter 1 Datalog: Concepts, History, and Outlook

spread to that use, and the first appearances in print of “Datalog” depict it as a database language. Computing with Logic only came out in January 1988, but Maier introduced the name to students at Oregon Graduate Institute (then Oregon Graduate Center) and to collaborators at Stanford University while he was working on the book. The first uses of the name Datalog in the general literature, as far was we can tell, were in the proceedings of the PODS conference in March 1986, where it appears in Afrati et al. [1986] and Bancilhon et al. [1986]. The first mention of Datalog in the gray literature was by Harry Porter, who had taken a class from Maier on the magic of Logic Programming based on the notes for Computing with Logic. The earliest use we have found is a manuscript by Porter from October 1985 [Porter 1985]. We note that the name “Datalog” was also applied about the same time to a natural-language system for database query (written in Lisp) [Hafner and Godden 1985]. That use came from “database dialog,” and does not seem to have a connection to Prolog. In any case, by mid-1986, Datalog as the name for a database language had been established. Viewed from the programming language side, it was a restricted version of Prolog, with no function symbols and no extra-logical predicates, such as cut and var. From a logical perspective, it was based on definite Horn clauses over predicates. (A definite Horn clause is a disjunction of literals with exactly one positive literal.) As a database query language, it resembled domain relational calculus. There is some variation across authors about whether or not “pure Datalog” includes negation; here we will assume not. We do assume Datalog programs are safe; safety conditions need to be modified appropriately for some of the extensions below.

1.4

Extensions to Datalog While pure Datalog has a simple syntax and a clean theory, it does lack direct support for features commonly found in database languages: negation (and set difference), arithmetic, sets and multi-sets, aggregation, updates, and typing and constraints. (There are also system capabilities such as concurrency and recovery that are orthogonal to expressiveness. We do not cover them in this chapter.) There have also been extensions to add object-oriented features (such as object identity, type hierarchies, and nested structures) and higher-order capabilities (such as variables over predicates). We treat negation in some detail, as it has a large effect on semantics and evaluation techniques, and touch on other extensions in varying degrees of detail.

1.4 Extensions to Datalog

17

1.4.1 Negation Pure Datalog cannot express all queries that the relational algebra can because, in particular, it lacks an analog for the difference operator (which requires negation in logic). For example, to find pairs of people who do not share a common advisor, we would want to use a negation operator and write unrelated(P1,P2) :person(P1,_,_), person(P2,_,_), not related(P1,P2).

or something similar.7 One approach would be to extend Datalog to use a form of logic more general than definite clauses, and use an evaluation approach based on more general theorem-proving methods. However, such a generalization brings with it much higher computational costs, as the simple proof scheme (called linear resolution) is no longer complete for more general logics. In fact, this “classical” approach to negation does not easily extend to how negation is used in databases. Prolog makes use of the notion of “negation as finite failure,” whereby the negation of a goal is established if attempting to establish the goal fails. This approach is a form of default negation: if something cannot be determined to be true, it is assumed by default to be false. Traditional databases can be seen as using such a form of default negation, if we assume a relation is understood as a set of true facts. If a database query asking whether “Michael” is an employee finds no employee fact for “Michael,” then it answers “No,” that is, “Michael is not an employee.” It does not answer “I don’t know,” which would be an answer according to the classical logic, since neither “Michael is an employee” nor “Michael is not an employee” is logically implied by the set of (positive) employee facts in this case. The meaning of default negation is straightforward in traditional databases, but becomes problematic when queries are recursive. The simplest definition that raises questions has the form: shavesBarber :- not shavesBarber.

Interpreted as a rule to define shavesBarber in some “real world,” this seems to be saying that if shavesBarber is false in our world, then the body of the rule is true, and so it implies that shavesBarber is true. If, on the other hand, shavesBarber is true in our world then the body of the rule is false, and so there is no way to show 7. The use of negation requires an additional syntactic condition to ensure safety, namely that every variable in a rule appear in at least one non-negated goal in the rule body. Thus, for example, a rule unrelated (P1, P2) :- not related (P1, P2) would not be safe.

18

Chapter 1 Datalog: Concepts, History, and Outlook

that shavesBarber is true. Having failed to establish truth of this fact, by negation as failure one might conclude that shavesBarber should thus be false. The first proposal from the Prolog community to precisely define the meaning of negation in arbitrary rule sets was by Clark [1978] and is known as Clark’s completion. The idea is to “complete” the set of rules (thought of as implications) by essentially turning them into if-and-only-if rules (after combining clauses defining the same predicate using disjunction), and then to use first-order logical implication to determine which goals are true and which are false. So under the completion, the rule above becomes shavesBarber ⇐⇒ ¬shavesBarber, which is inconsistent and has no models. So this situation can be taken to mean that shavesBarber is neither true nor false. This result seems reasonable for this particular one-rule program, but suppose we have a perfectly reasonable and consistent program and one adds the additional shaves-not-shaved rule above. Suddenly the entire database becomes inconsistent just because of one rule—even if this rule has nothing to do with the rest of the database. But problems with Clark’s completion do not end there. Consider the following positive rule: iAmHappy :- iAmHappy.

What should this program say about iAmHappy? Under the usual semantics for positive rules, iAmHappy would be false since it is not in the minimum model of this program. Alternatively one can see that this rule (being a tautology) says nothing about anything (including iAmHappy) being true, so by default reasoning iAmHappy should be false. But if we take the completion of this program, we get iAmHappy ⇐⇒ iAmHappy which is a tautology, so neither iAmHappy nor not iAmHappy is a logical consequence of this completed program. Thus, iAmHappy is not determined to be either true or false. This situation might not seem much of a problem for this program, but it has deeper consequences for other, more interesting programs. Consider a program defining the transitive closure of an edge relation: reachable(X, Y) :- edge(X, Y). reachable(X, Y) :- reachable(X, Z), edge(Z, Y).

If the edge relation has self-loops, there will be instances of the second rule that simplify to the form reachable(a,b) :− reachable(a,b). For example, if we simply have the single edge fact, edge(b,b), and take the instance of the second rule using X = a, Y = b, and Z = b, then edge(b,b) can be simplified away leaving the self-recursive form. In order to get the right answers for pairs of nodes that are unreachable, we must treat such an instance as determining that reachable(a,b) is

1.4 Extensions to Datalog

19

definitely false, not that it is unknown. Thus, this example shows that the completion semantics gives the wrong answer for the fully positive program for transitive closure, i.e., the program above, under the completion semantics, does not define transitive closure. It is particularly unpleasant that this attempt to extend the semantics of definite programs (the least model) to programs with negations changes the meaning of definite programs. But this situation indeed is not surprising when we observe that the completion of a Datalog program is a first-order theory and that transitive closure is not first-order definable. And since, for many in the database community, the ability to define transitive closure was one of the key motivations for adding recursion to a query language, the completion semantics appeared to be a wrong semantics for Datalog programs with negation. Its problem was not even with negative recursion, but with positive recursion. So the search commenced for a semantics of logic programs with negation that would extend the least-fixed-point semantics of positive rule sets. Nicolas and Gallaire [1977] described the alternative approaches to formalizing database semantics in logic, using a theory or an interpretation. The completion semantics used a theory. The following semantics use interpretations. The first approach was to restrict the set of programs to those that were not problematic. If a program does not involve recursion through negation, there is no problem. When a program can be stratified [Apt et al. 1988, Van Gelder 1989] in such a way that whenever a first goal depends negatively on a second goal, the second goal is in a strictly lower stratum than the first, then we can compute the goals in order from lower strata to higher strata. For example, consider the program: p q s s

::::-

not q. r, s. q. not t.

We can assign a stratum to each proposition, as, for example, t:1, r:2, s:3, q:3, and p:4. Then every proposition depends positively on other propositions at its stratum or lower, and depends negatively on propositions only at strictly lower strata. So p depends negatively on q and q is at a lower stratum than p. Propositions q and s depend on each other positively and have the same stratum. For such stratified programs, we can compute the meanings of propositions by starting with the propositions of the lowest stratum and then moving to the next higher stratum. Thus, when the negation of a proposition is needed, it has already been completely evaluated at a previous stratum. This stratum-by-stratum fixed point, which is called the perfect model of the program [Przymusinski 1988b], defines the semantics

20

Chapter 1 Datalog: Concepts, History, and Outlook

for stratified programs. A key point in this definition is how to precisely define when a goal depends on another goal. Initially the notion of predicate stratification was defined that used the static predicate call graph to determine potential goal dependencies. The introductory example, along with the rule for unrelated above, is predicate-stratified, with person, related, and advised in the first stratum and unrelated in the second stratum. However, many meaningful, useful programs were not predicate-stratified, but still did not have real recursion through negation and could be given reasonable meanings. So a second, more refined, definition of stratification, called local stratification [Przymusinski 1988a], was proposed, but the same situation occurred. Thus, dynamically stratified [Przymusinski 1989] and yet more definitions were proposed. It was generally agreed that all forms of stratification gave appropriate semantics for the programs to which they applied, but a semantics that worked for all programs with negation was desired. Two general solutions followed: the Well-Founded Semantics [Van Gelder et al. 1991] and the Stable-Model Semantics [Gelfond and Lifschitz 1988]. There have been (many) others, but these two have had the most acceptance and influence. We will discuss them in turn. For a detailed treatment of this topic, see Chapter 2. The Well-Founded Semantics (WFS) uses a three-valued logic, in which goals can be true, false, or undefined. The WFS can be defined by an iterated fixed point, which at each iteration computes facts that must be true and facts that must be false. When the computation converges, facts that are neither determined true nor false are undefined. In this manner, a three-valued well-founded model can be constructed for every program. If the program is stratified (for any of the forms of stratification), the well-founded model will be two valued and agree with the perfect model. For the program teaches_db(maier) :- not teaches_db(maier).

the fact teaches_db(maier) is undefined in the WFS. And for the program someone_teaches_db someone_teaches_db teaches_db(maier) teaches_db(warren)

::::-

teaches_db(maier). teaches_db(warren). not teaches_db(warren). not teaches_db(maier).

all facts are undefined in the WFS. The well-founded model is polynomially computable in the size of the ground program, and has polynomial data complexity for Datalog programs. The WFS satisfies the “relevancy” property, in that the truth value of a goal depends only on goals it depends on in the goal-dependency graph.

1.4 Extensions to Datalog

21

Thus, there is a goal-directed, top-down evaluation strategy for the well-founded semantics that is worst-case quadratic in the size of the ground program [Chen and Warren 1996]. We discuss the most popular such strategy in Section 1.5. The other definition is the Stable-Model Semantics (SMS). Stable models are two-valued, and a Datalog program may have multiple stable models, or it may have none. A stable model is defined as a model that has the property that, if all negative goals in the (ground) Datalog program are interpreted as true or false according to the model and the program is then simplified accordingly, then the minimum model of the resulting positive program is that original model. For instance, the program teaches_db(maier) :- not teaches_db(maier).

has no stable models, but the program someone_teaches_db someone_teaches_db teaches_db(maier) teaches_db(warren)

::::-

teaches_db(maier). teaches_db(warren). not teaches_db(warren). not teaches_db(maier).

has the following two stable models: {someone_teaches_db, teaches_db(maier)} {someone_teaches_db, teaches_db(warren)}

Finding a stable model is in general NP-hard in the size of the (ground) program. Also, there is no obvious goal-directed way to compute a stable model; the entire program may have to be processed, since an inconsistency (e.g., a rule p :not p) anywhere in the program means the entire program has no stable model. A rule such as p :- q, not p. (where p appears nowhere else in the program) implies that q must be false in any stable model, regardless of how q might be defined elsewhere in the program. The AI community was most interested in the SMS, which was defined by Gelfond and Lifschitz [1988], who came from theoretical AI. The general idea of assuming false what is not known to be true is known as the “closed world assumption” [Reiter 1977b]. It can be interpreted as a kind of common-sense reasoning and has close ties to nonmonotonic logic formalisms that had been studied previously in the AI community [Bobrow 1980, Brewka et al. 1997] by John McCarthy, Drew McDermott, Jon Doyle, Ray Reiter, and Robert Moore, among many others. Since many AI formalisms inherently have high complexity, the fact that SMS was

22

Chapter 1 Datalog: Concepts, History, and Outlook

NP-hard was not seen as a particular problem. Datalog under the SMS led to the exploration of some traditional AI problems, such as planning and reasoning about change and actions. The database community was more interested in WFS, which was defined by Van Gelder et al. [1991], who were researchers in the theoretical database community. Its computational properties conformed much more to traditional database requirements. Some have constrained their interest to Datalog programs that have two-valued well-founded models, which can be considered as the strongest definition possible for stratified programs. The logic programming community has shown interest in both definitions, but probably more in the SMS. SMS has spawned a large subcommunity that studies nski 1999]. The idea is to use Answer-Set Programming (ASP) [Marek and Truszczy´ Datalog under SMS as a “programming language” to specify sets of propositions, i.e., those determined as true in some stable model of the program. Many combinatorial problems can be naturally specified by such Datalog programs. ASP also allows disjunction in the rule heads—an extension inspired by the earlier work on disjunctive logic programming [Minker 1994, Minker and Seipel 2002]. It is shown in Eiter et al. [1997] that this extension adds both the expressive power and complexity to the language. Much research has gone into developing efficient solvers that can find one or all of the stable models of a Datalog program. ASP is closely related to SAT solving, i.e., finding truth assignments that satisfy a set of propositional formulas. The difference is that an ASP solver will find only “minimal” satisfying truth assignments,8 and for this reason they may be better suited to problems where minimality is important, such as planning. Planning is often formulated as reachability in a graph whose nodes are states and whose edges are possible actions, and formulating reachability using logic rules requires minimality. There have been proposals for integrating the desirable aspects of WFS and SMS into a single framework. Of particular interest is FO(ID) [Denecker et al. 2001, Vennekens et al. 2010] and, more recently, the founded semantics [Liu and Stoller 2018]. This line of work may eventually provide the basis for a broadly accepted integrated semantics for Datalog.

1.4.2 Arithmetic and Other Evaluable Predicates As presented so far, Datalog is limited to symbolic computations. Database queries often need arithmetic operators and comparison, and functions on other scalar 8. Minimality in this context means there is no other satisfying truth assignment that makes a proper subset of the propositions true. Note that there can be multiple, incomparable such assignments.

1.4 Extensions to Datalog

23

types (such as string matching and concatenation). Such functions can be viewed as predefined predicates (with large, possibly infinite, extents). For example, addition can be modeled as a three-place predicate sum(M,N,P) that contains all facts where M + N = P, such as sum(2,3,5) and sum(5,0,5). The greater-than comparison can be modeled as a predefined two-place predicate gt(M,N) containing all pairs of numbers that stand in the correct relationship, such as gt(5,1) and gt(6,3). Obviously, the extents of such predicates cannot be stored explicitly (unless the domain is small, such as for Boolean operators). Thus, such predicates are actually supported by evaluation in the underlying processor. Because of this computational realization the use of evaluable predicates is restricted in which positions must be bound before evaluation. For example, sum might require that the first two or all three arguments be bound; gt requires both arguments to be bound. The conditions for safe Datalog rules can be extended to incorporate these binding requirements. Maier and Warren [1981] give a general method of checking safety in a query by modeling binding patterns with functional dependencies. Most logic languages provide alternative syntax for evaluable predicates. For example, the rule shortcut(Loc1, Loc2, D) :dist(Loc1, Loc2, D1), dist(Loc1, Loc3, D2), dist(Loc3, Loc2, D3), sum(D2, D3, D), gt(D1, D).

can be written in Prolog as shortcut(Loc1, Loc2, D) :dist(Loc1, Loc2, D1), dist(Loc1, Loc3, D2), dist(Loc3, Loc2, D3), D is D2 + D3, D1 > D.

1.4.3

Aggregation and Sets Aggregates (such as sum, count, and average) are important operations for database query, and so should be supported in a data language such as Datalog. But just as the inclusion of recursion complicates the semantics of negation, it also complicates the semantics of aggregation. With any reasonably powerful aggregation capability, we can define negation in terms of aggregation. For example, the COUNT of solutions to a goal is 0 if and only if the goal has no solutions. So, just as definitions that involve recursion through negation can be problematic, so can definitions that involve recursion through aggregation. There have been a number of approaches to providing a semantics of aggregation operators in Datalog.

24

Chapter 1 Datalog: Concepts, History, and Outlook

Prolog has basically one aggregation operator, findall/3 (or variants setof/3 and bagof/3),9 which collects the answers to a query into a list. With this operation, the programmer can then explicitly program many other aggregation operators by iterating over that list of collected results. But that approach requires complex data structures (the list) which are excluded from basic Datalog. LDL [Tsur and Zaniolo 1986] added a set data structure to Datalog, in which a program variable could take on the value of an extensional set. Aggregates were then defined over these extensional sets in a way not unlike in Prolog. In Prolog the evaluation of some definitions would go into infinite loops; in LDL some specifications were “undefined” or “unsafe”. Fully supporting extensional sets in a logic language, including general set unification, is quite complex, and was abandoned in the move to LDL++ [Zaniolo et al. 1993]. Another approach to supporting sets, an intensional approach in which sets have names, is discussed below in Section 1.4.7 on higherorder extensions. XSB [Swift and Warren 2011] supports aggregation by allowing the user to provide a lattice operation or a partial order relation defined in Prolog, and then applying the operator incrementally as values are added to goal tables to retain the set of minimal covering values. This capability has found interesting application in implementing conditional preference theories [Cornelio et al. 2015]. But XSB provides no formal definition of when aggregations are well defined, and the user is responsible for avoiding problematic definitions. A first approach to ensuring the well-definedness of recursive rules that include aggregation was to require that definitions involving aggregation be stratified. In this case, the computation can proceed bottom-up from lower to higher strata in such a way that any values needed for an aggregate operator are defined at a lower stratum from the aggregate, and can therefore be completely evaluated before they are needed. Predicate stratification of a program is easily decided at compile time, but it turns out that many reasonable uses of aggregation are not predicate stratified. But local stratification is more complex and essentially requires full computation to determine whether a Datalog program is locally stratified. So efforts have been made to define efficiently recognizable classes of locally stratified programs that cover all (or most of) the meaningful aggregate definitions that involve recursion, for example LDL++ [Zaniolo et al. 1993]. Another approach is to define the meaning of aggregates for all programs using the Stable-Model Semantics or the Well-Founded Semantics [Kemp and Stuckey 1991, Pelov et al. 2007]. These approaches define powerful languages for aggregates

9. The notation pred/N indicates that the predicate pred takes N arguments.

1.4 Extensions to Datalog

25

and give them semantics under both frameworks, but some program restrictions seem necessary to ensure efficient computation of correct answers. An interesting example of a meaningful definition using non-predicatestratified aggregation is for shortest path. In XSB, it would be written as: minimum(A, B, C) :- C is min(A, B). :- table sp(_, _, lattice(minimum/3)). sp(X, Y, 1) :- edge(X, Y). sp(X, Z, D) :- sp(X, Y, D1), edge(Y, Z), D is D1 + 1. edge(a, b). edge(b, c).

edge(b, c). edge(a, c).

edge(c, d).

The table declaration indicates the use of the minimum lattice operation to aggregate the third argument of the sp/3 predicate with grouping on the first two arguments. When a query :- sp(X, Y, D). is invoked, computation proceeds as usual, except when a new answer, say sp(a, b, 3), is computed for the predicate sp/3, the table is checked to see if the new answer for the aggregate argument, here 3, is smaller than an existing answer, say 1 for sp(a, b, 1). If it is smaller (or if there is no previous answer for X and Y) the new answer is added (and any old one is deleted). If it is not smaller than the existing answer, the new answer is not added and computation fails. We can understand the semantics of this program by assuming that the aggregate table declaration asserts, in this case, the axiom ∀x , y , d , d ′ (sp(x , y , d) ∧ d ′ > d ⇒ sp(x , y , d ′)) and so sp(X, Y, D) is true if there is some path from X to Y of length D or less. Operationally we do not add a new answer to the table if it is implied by an existing answer. And when we add a new answer, we delete any old one implied by the new one. We also mention here approaches to aggregation that involve introducing new structures or language constructs, such as lattices [Conway et al. 2012, Ross and Sagiv 1992], keys on results [Zaniolo 2002], and non-deterministic choice [Zaniolo and Wang 1999].

1.4.4 Existential Variables in Rule Heads: Datalog± Datalog± [Cal`ı et al. 2011, Gottlob et al. 2014] extends the basic Datalog by allowing existential variables in the rule heads. That is, rules can now have the form ∃Z head(X, Z) :- body(X, Y).

26

Chapter 1 Datalog: Concepts, History, and Outlook

Here, the head of the rule has existentially quantified variables (Z), while the variables X (common to the head and the body) and Y (exclusive to the body) are universally quantified outside of the rule. These latter quantifiers are omitted, following the usual convention in logic programming. Another novelty is that the head can have the form X1 = X2 (possibly quantified with ∃), meaning that the inferred instantiations for X1 and X2 must be the same constant. Here is an example: 1 2 3 4 5 6 7 8

human(mary). human(bob). % facts ∃P parent(H,P) :- human(H). % TGDs with existentials ∃M mother(H,M) :- human(H). human(H) :- parent(H,P). % TGDs without existentials human(P) :- parent(H,P). human(H) :- mother(H,M). human(M) :- mother(H,M). M1 = M2 :- mother(H,M1), mother(H,M2). % an EGD

Rules 2 and 3 here have existentials in the head and rule 8 has equality in the head. In the database theory (from which Datalog± came), the rules with head-existentials as well as the regular Datalog rules (such as rules 4–7) are called tuple-generating dependencies (TGDs), while the rules with head-equality, such as rule 8, are known as equality-generating dependencies (EGDs) [Beeri and Vardi 1981].10 What is the meaning of such rules? As with regular Datalog, we can start by deriving more facts bottom-up, as explained in Section 1.1. The first bottom-up application of the rules above would derive these facts: parent(mary, o1). parent(bob, o2). mother(mary, o3). mother(bob, o4).

The little twist here with respect to the ordinary bottom-up strategy is that rules 2 and 3 have existential variables in the head and they must be bound to completely new constants that do not appear elsewhere in the database or in the rules. In the above, we denoted these constants with o1–o4. In databases, such new constants are usually called “nulls” and in logic “Skolem constants.” We assume that each head-existential gives us a unique new constant, since there is no reason to think that any of these constants might refer to the same entity. (There are also good theoretical reasons behind this choice, of course—see, for example, the work of Cal`ı et al. [2013] and Deutsch et al. [2008].) We will further derive that each of these new entities is a human and therefore has an unknown parent and a mother: 10. Beeri and Vardi [1981] viewed the above rules as constraints—or dependencies—and used a different, but equivalent, tableaux representation for them.

1.4 Extensions to Datalog

parent(o1, mother(o1, parent(o3, mother(o3,

o1p). o1m). o3p). o3m).

parent(o2, mother(o2, parent(o4, mother(o4,

27

o2p). o2m). o4p). o4m).

and so on. The resulting set of facts can (and, in this example, will) be infinite because this process will keep inventing new nulls. What happens if we use mary and bob in rules 2 and 3 again? In case of rule 2, we will derive a pair of new entities, o1’ and o2’, as the second parents for mary and bob (and then the third, the fourth, and so on). There is no reason to assume that these new constants refer to the same entities as o1 and o2. Things are different in case of rule 3, however. Here we derive entities o3’ and o4’, but then the EGD rule 8 says that o3’ should be the same as o3 (since both are mothers of mary, and EGD rule 8 says that each human has only one mother) and o4’ must be the same as o4 (for a similar reason: bob can have only one mother). As a result, o3’ and o4’ are renamed into o3 and o4, respectively, which turns the newly derived facts mother(mary, o3’) and mother(bob, o4’) into the “old” facts mother(mary, o3) and mother(bob, o4), making the new facts duplicate and redundant. The set of facts thus computed satisfies all the rules and is known as a universal model of these rules. It has a homomorphism into any other model of these rules and all universal models are isomorphic to each other [Deutsch et al. 2008, Maier et al. 1979]. (They are the same up to a renaming of the null constants.) Query answering in a database defined by these rules and facts is taken to mean query answering in the universal model. Usually, only the answers that do not contain null values are considered meaningful. The bottom-up process just described is known as the chase [Deutsch et al. 2008, Maier et al. 1979] because it “chases” after a database instance that satisfies the given rules. It is also sometimes called the oblivious chase, because it keeps rederiving new existentials, as in our case of the parents of mary and bob, and does not try to remember which existentials were used previously. The chase was originally proposed as a proof procedure for deriving implications of data dependencies [Maier et al. 1979], but later found other applications, including query containment and optimization [Johnson and Klug 1984] and data exchange [Deutsch et al. 2008]. There is also a restricted chase, which does not rederive existential constants such as o3’ and o4’ in the example above. The restricted chase is not guaranteed to compute the entire universal model. However, sometimes it is enough to compute a subset of that model, via the restricted chase, in order to have a decision procedure for query answering. These decidable cases are obtained via various syntactic restrictions on the appearance of variables in the rules [Calautti et al. 2015,

28

Chapter 1 Datalog: Concepts, History, and Outlook

Cal`ı et al. 2013, Gottlob et al. 2013]. The key idea (which dates back to Johnson and Klug [1984]) is that, although the chase may be infinite, various restrictions may reduce the problem of query answering and containment over the infinite chase to a similar problem over a bounded finite prefix of that chase. Note that although the chase, TGDs, and EGDs have been around since the late 1970s, they were used mostly for query processing and not as a language for knowledge representation. Datalog± as a language was introduced only in 2011 [Cal`ı et al. 2011]. Decidable cases of Datalog± are implemented in a number of prototypes, such as Alaska [da Silva et al. 2012], DLV∃ [Alviano et al. 2012], and IRIS±.11 A partial implementation of Datalog± is provided in the Ergo knowledge-representation and reasoning system [Coherent Knowledge LLC 2017]. It does not rely on the chase to ensure termination and does not take advantage of the aforementioned decidable cases, however. Instead, it uses a form of bounded rationality called radial restraint [Grosof and Swift 2013]. Ergo’s semantics for head-existentials is also slightly different and corresponds to the aforementioned restricted chase. In fact, the restricted chase is equivalent to replacing head-existentials with Skolem functions and, in that sense, any Datalog system that permits function symbols in the rule heads, such as XSB [Swift and Warren 2011] or LogicBlox [Aref et al. 2015], can be said to partially implement Datalog±.

1.4.5 Typing and Constraints Most computer languages have some notion of typing or constraints (or both). In databases, these aspects both manifest in the database schema, which supports the declaration of domains (datatypes) for attributes, primary keys (uniqueness constraints) on single tables, and foreign keys (referential integrity) between pairs of tables. For example, for the tables person(ID, First, Last) thesis(PID, Univ, Title, Year , Area)

the database schema might declare that Univ is a character string, Year is a 4-digit number, ID is a key for person, and every PID value in the thesis table should have a matching ID value in the table person, among other things. Most programming languages will associate types with program variables—implicitly or explicitly—and use those types to determine that expressions are well formed. For instance, given that the type of variable Title is a character string, then a compiler can determine that the expression 11. http://bitbucket.org/giorsi/nyaya

1.4 Extensions to Datalog

29

concat(’Thesis: ’, Title)

is well formed, whereas Title + 6

is not. One might say that a database schema protects data from “wrong code,” while programming-language types protect code from “wrong data.” Constraints and typing information can also improve execution times, for example, by simplifying queries or limiting the bindings that need to be considered for a given variable. Datalog as a language has neither constraints nor types. However, we can consider integrity constraints as an alternative kind of axiom in a deductive database. So far, axioms have been interpreted as deductive rules: the EDB should be augmented so that resulting EDB plus IDB satisfies the axioms. But an axiom can also serve as an integrity constraint—it prohibits models where the axiom is not satisfied [Nicolas and Yazdanian 1977]. For example, if we want to require that the first argument of the area predicate is a unique key, we could use the axiom equal(L1, L2) :- area(S, L1), area(S, L2).

This constraint applies to an EDB table, but there can be constraints on IDB predicates as well. For example, if we wanted to prohibit cycles in the advised predicate, we can define the ancestor predicate as ancestor(AID,PID) :- advised(AID,PID). ancestor(AID,PID) :- advised(AID,PID2), ancestor(PID2,PID).

and then say an advisee cannot be an ancestor of his or her advisor. not ancestor(PID,AID) :- advised(AID,PID).

To distinguish axioms that act as integrity constraints from those that act as derivation rules, the former are often written in the form of a denial [Kowalski et al. 1987], which can be viewed as a query that should have an empty answer.12 In this denial form, our two constraints above become :- area(S, L1), area(S, L2), not equal(L1, L2). :- advised(AID, PID), ancestor(PID, AID). 12. Another convention is to use a right-pointing arrow for rules that should be interpreted as integrity constraints [Aref et al. 2015].

30

Chapter 1 Datalog: Concepts, History, and Outlook

Some authors put an explicit false proposition to the left of the arrow in denial form, to emphasize that satisfaction of the body should be regarded as inconsistency: false :- area(S, L1), area(S, L2), not equal(L1, L2).

Typing information about predicates can also be represented in the same form as integrity constraints. For example, if isPersonID is a unary EDB predicate that contains all IDs of persons (viewing a type as a set of values), then we might require all participants in the ancestor predicate to be people: isPersonID(AID) :- ancestor(AID, PID). isPersonID(PID) :- ancestor(AID, PID).

or, in denial form, :- ancestor(AID, PID), not isPersonID(AID). :- ancestor(AID, PID), not isPersonID(PID).

Denial form for an integrity constrain suggests a direct way of enforcing it: simply execute is as a query and see if it succeeds. In that case, the constraint is violated and the answers to the query are the witnesses of that violation. However, the prospect of enforcing integrity constraints in this manner is not attractive, if it has to be done frequently (such as after each database update). It potentially involves a scan of the whole database, even if only a small part changes. That observation has led to the approaches that target constraint maintenance rather than constraint enforcement. Constraint maintenance assumes that before an update, all constraints are satisfied, so that any violations after the update must be the consequence of the values that were changed [Lloyd et al. 1987]. For example, assume the key constraint on area currently holds, and an update inserts the fact area(ss, ’Systems Science’). Then any violation of :- area(S, L1), area(S, L2), not equal(L1, L2).

must involve that new fact. The integrity constraint can be “reduced” [Martinenghi et al. 2006] by matching the new fact with the first subgoal, leaving :- area(ss, L2), not equal(’Systems Science’, L2).

This clause is much quicker to check, especially if there is an index on area. Similarly, for the integrity constraint on ancestor, if we inserted advised(kr, jf), then we would only need to check whether

1.4 Extensions to Datalog

31

:- ancestor(jf, kr).

can be established or not. There have been a range of constraint-maintenance approaches for logic languages, including forward-chaining updates to try to establish denials [Kowalski et al. 1987] and translating integrity constraints to SQL triggers on the EDB relations [Decker 1986]. Other methods are discussed in the survey by Martinenghi et al. [2006]. Constraint maintenance can still involve significant computation, especially for constraints on IDB predicates. For certain constraints, it may be possible to show statically that they must hold if other constraints are enforced. A particular situation where this approach is valuable (and generally feasible) is to show that typing constraints on IDB predicates hold if such constraints on EDB predicates are enforced. For example, for the typing constraint person(AID) :- advised(AID, PID).

on the EDB relation advised can be used to establish the typing constraint person(AID) :- ancestor(AID, PID).

essentially by proving a theorem on the IDB predicate ancestor. Richer type systems, including features such as subtyping and mutual exclusion, can also be statically checked [Reiter 1981]. However, there is a trade-off between the speed of checking and the completeness of the method [Zook et al. 2009]. There are other classes of constraints where the consistency of the IDB can be proved from the consistency of the EDB. For example, Wang and Yuan [1992] give a method for handling downward closed constraints—constraints that are preserved if tuples are removed from a predicate, such as functional dependencies. Typing and constraints, as noted, can help prevent data corruption and programming errors, but they can also improve performance. In terms of low-level implementation, knowing the type of an argument, such as Integer, permits a specialized representation, and avoids checks for handling other possibilities. At a higher level, there is a long history of using constraints for semantic query optimization: rewriting queries into simpler or more efficient forms, based on constraints that hold. For example, Chakravarthy et al. [1990] introduce a method of residues that can variously remove redundant literals from a query or include additional ones that are more restrictive. Residues are produced by matching parts of an integrity constraint to subgoals in a rule. The unmatched part of the constraint (the

32

Chapter 1 Datalog: Concepts, History, and Outlook

residue) is attached to the rule and used during query evaluation. For example, consider the following rule for possible dissertation committee chairs for a student: possibleChair(S, P) :enrolled(S, D), appointed(P, D), tenured(P).

which says the possible chairs for a student are the tenured professors appointed in the department where the student is enrolled. If there is an integrity constraint :- adjunct(P), tenured(P).

that says an adjunct professor cannot be tenured, then the residue of the constraint relative to the rule is { :- adjunct(P)}

This residue could be used to simplify the query :- possibleChair(lvk, P), not adjunct(P).

by removing the second literal (if there is only one rule for possibleChair). Lee and Han [1988] adapted the residue method to handle recursive queries, and Lakshmanan and Missaoui [1992] focused on semantic query optimization for inclusion and context dependencies. (A context dependency is an extension of an inclusion dependency that involves algebraic expressions and not just individual relations.) Sagiv [1987] incorporates tuple-generating dependencies (such as multi-valued and join dependencies) into testing containment of Datalog programs, which thereby extends the earlier methods for query optimization [Johnson and Klug 1984]. (See Section 1.5.5 for the role of containment in optimization.) Levy and Sagiv [1995] look at optimizing recursive programs, using integrity constraints to avoid rule applications that will necessarily have empty results, as well as augmenting rules with additional restrictions.

1.4.6

Object-Oriented Logic Programming Extending Datalog and Logic Programming in general with object-oriented features has been an active area in database and logic programming research since the late 1980s, and the most influential systems were developed in the 1990s and 2000s. Two main approaches can be identified among the variety of works on this subject. The first can be called object-oriented Prolog or OOP. That approach takes Prolog, with its procedural semantics and extra-logical predicates, and adds the syntactic constructs to represent objects, classes, and so forth. The main idea is support for object-oriented programming in Prolog, rather than to develop new logical

1.4 Extensions to Datalog

33

theories for a combination of these two paradigms. Prolog++ [Moss 1994] and SICStus [Carlsson et al. 2015] took this approach in the 1990s and more recently Jinni [Tarau 2004] and Logtalk [Moura 2000, Moura et al. 2008] did so. Logtalk is by far the best known and best developed approach and system in the OOP stream of work. In contrast, the object-logic approach, or OLOG, aims to extend the very logical foundations of logic programming to incorporate the notions of complex objects and classes not only at the level of syntax but also to make them first-class citizens in the semantics of the logic. A great number of approaches to OLOG have been proposed [Abiteboul and Grumbach 1987, Abiteboul and Kanellakis 1989, A¨ıt-Kaci and Nasr 1986, A¨ıt-Kaci and Podelski 1993, Beeri et al. 1991, Beeri et al. 1988, Chen and Warren 1989, Heuer and Sander 1989, Kifer and Wu 1989, Maier 1986a, McCabe 1992], where F-logic [Kifer and Lausen 1989, Kifer et al. 1995] is the best known and most developed—both in terms of theory and systems. In OOP systems, an object is typically a container for data (facts) and rules. Some of the predicates may be exported to the outside world and can be invoked by other objects. Here is an example from Logtalk’s manual, which illustrates the idea. :- object(bird). :- public(mode/1). mode(walks). mode(flies). :- end_object. :- object(penguin, extends(bird)). mode(swims). mode(Mode) :- ^^mode(Mode), Mode \= flies. :- end_object.

The snippet above defines two classes, bird and penguin, where the latter is a subclass of the former. Bird exports the mode-predicate, which represents the modalities of birds’ movements. In general, this class would have many more predicates to represent the various properties of birds, but here we ignore those for the sake of clarity. The second class, penguin, modifies the transportation modalities for penguins by adding the mode swims, inheriting the mode walks, and blocking the inheritance of flies. The latter is achieved via the construct ^^mode(...), which calls the predicate mode in the immediate superclass of penguin (i.e., bird in our case). In contrast to OOP approaches, OLOG approaches build on the idea of a complex object, which is mainly influenced by database practice. The theory of complex

34

Chapter 1 Datalog: Concepts, History, and Outlook

objects underwent rapid development in the 1980s, starting with the idea of NonFirst-Normal-Form and Nested Relation data models and ending with a general and elegant data model for complex objects [Abiteboul and Beeri 1995, Bancilhon and Khoshafian 1989, Beeri 1989]. RDF [Lassila and Swick 1999] and JSON13 are largely reinventions of this earlier work and of the subsequent work on semi-structured data [Abiteboul et al. 2000]. Extending these theories to include logic rules proved harder, however, and some doubted that this goal was even possible given that object-oriented systems are inherently “pointer-based” and therefore non-logical [Ullman 1987]. One of the first important developments was the LOGIN language by A¨ıt-Kaci and Nasr [1986]—later extended to LIFE [A¨ıt-Kaci and Podelski 1993]—which included elements of functional programming. LOGIN gave the basic idea of how a simple logical syntax for logic object-oriented databases could be constructed, but both LOGIN and LIFE were largely constraint languages with semantics that was at odds with database query languages. This limitation led Maier [1986a] to propose elements of a semantics that was “right for databases,” and this semantics, in turn, inspired the work on O-logic by Kifer and Wu [1989], Kifer and Wu [1993]— the first coherent exposition of an OLOG approach that was completely logical and fully compatible with the logical foundations of both databases and logic programming. F-logic, which culminated this line of work, was developed shortly thereafter [Kifer and Lausen 1989, Kifer et al. 1995]. The first F-logic system was FLORID [Frohn et al. 1998], but it is no longer being maintained. The open-source Flora-2 system [Kifer 2015, Yang et al. 2003] is both well maintained and has a community of users. There also are two commercial versions: Ontobroker from Semafora, GmbH [1999] and Ergo from [Coherent Knowledge LLC 2017]. The latter is based on Flora-2, but includes many additional language constructs, connectivity to the outside world, optimizations, a development environment, and knowledge-acquisition tools. The basic language constructs in F-logic are the subclass and membership relations as well as the frame representation (whence the “F” in F-logic). For instance, bird123:Penguin means that an object with the particular Id bird123 belongs to class Penguin. The statement Penguin::Bird means that the class Penguin is a subclass of the class Bird and Bird is a superclass of Penguin. (F-logicbased systems treat upper-case symbols just like the lower-case ones—as constants;

13. http://www.json.org

1.4 Extensions to Datalog

35

variables are prefixed with the “?” sign instead.) Frames14 specify properties of objects, as in Bird[|limb(wing)->2, limb(leg)->2, mode->{flies, walks}|]. Penguin::Bird. Penguin[|mode->swims|]. Penguin[|mode->?X|] :- Bird[|mode->?X|], ?X != flies. bird123:Penguin[name->Tweety, limb(leg)->1]. \neg ?B[mode->walks] :- ?B:Bird[limb(leg)->?N], ?N means “has value” and [|...|] means the value is a default one. Thus, the first statement above means that, in any object in class Bird, the default value for the attributes limb(wing) and limb(leg) is 2 and the default values for mode are flies and walks.) These default values can be overwritten. In terms of Java, these properties correspond to instance methods. Thus, the first statement above says that normally birds have two legs and wings, and their modalities of movement are flying and walking. Note that limb is a parameterized property, which can be thought of as a method that takes arguments. We already discussed the second statement, which says that penguins are birds, and the third statement explicitly says that penguins swim. This explicit statement overrides the inheritance of the movement modality from Bird. However, we do not want to completely discard the information about birds’ movement modalities— we just want to drop the flying modality. The fourth rule does so, stating that, in addition to swimming, any movement mode of a bird—except flying—is also a valid modality for a penguin. The fifth statement is another frame: one that provides information about a particular bird with object ID bird123. First, it says that this object is a penguin. In addition, it says that this bird is (apparently a pet) named “Tweety”, and it is a one-legged creature. To indicate this situation, we use the bird123[...] construct, which provides information about a particular object instance, rather

14. F-logic frames can be viewed as a formalization of the concept of frames introduced by Minsky [1975].

36

Chapter 1 Datalog: Concepts, History, and Outlook

than default information about all objects in a class.15 Tweety will inherit some of the information from Bird, some from Penguin, and some of that information will be overridden. Specifically, the mode information for birds will be overridden by the more specific Penguin information, so bird123 will get the modalities of walking and swimming, but not flying. Tweety will also inherit the property of being biwinged, but the property of having two legs is overridden by the explicit statement that this bird has only one leg. There is more to the story, however. The last rule brings in an aspect of defeasibility, which is much more flexible and powerful than inheritance overriding. It says that birds with fewer than two legs do not walk. Thus there will be two contradictory inferred facts: that Tweety both walks and does not, and the result is that each of these facts defeats the other, so none of these conflicting derived facts will hold. We refer the reader to Wan et al. [2009] and the Flora-2 manual for the details of the logical semantics for defeasible reasoning in that system. In conclusion, we note that Flora-2 and Ergo go well beyond F-logic. Besides defeasibility, they support other important aspects of Datalog discussed in this chapter, including HiLog (Section 1.4.7), Transaction Logic (Section 1.4.8), modularity, some aspects of functional programming, and more. We should also mention that F-logic frames are closely related to semi-structured data [Abiteboul et al. 2000]. This relationship was gainfully employed in FLORID [Frohn et al. 1998], which uses F-logic to query semi-structured data, and WebLog [Lakshmanan et al. 1996], which uses an F-logic-like language for querying Web data.

1.4.7 Higher-Order Extensions Since the early days of Prolog, users felt constrained by the inability to query the meta-structure of logic programs. Since Datalog is a subset of Prolog, it inherits this limitation. For instance, we have already defined several binary predicates, such as advised, adjacent, and related. How can one find out in which relationships lvk and ks stand to each other? We could ask series of queries ?- advised(lvk, ks). ?- adjacent(lvk, ks). ?- related(lvk, ks).

15. In F-logic, classes are also objects and they can have information about themselves, which is not inherited by instances of these classes. In Java terms, this information corresponds to static methods.

1.4 Extensions to Datalog

37

This approach is hardly satisfactory: If additional relations are defined later on, more queries need to be added. It would be nice if we could just ask a single query ?- R(lvk, ks).

(R being a variable) and get an answer related(lvk, ks) (that is, R = related). One of the first logic programmers to pay attention to this issue was D. H. D. Warren [1982a], who proposed a programming style (or, one can say, an encoding scheme) that allows treating predicates as first-class objects. The idea is, roughly, to introduce a new predicate, say binary_property/3, and then represent information about the aforesaid predicates as binary_property(advised, lvk, ks). binary_property(adjacent, lvk, ks). binary_property(related, lvk, ks).

In this approach, we still cannot ask ?- R(lvk, ks), but we can write instead: ?- binary_property(R, lvk, ks).

which, although awkward, does the job. This approach is essentially an adaptation of the well-known technique from formal AI, which prescribes a single predicate, true (or holds) and “downgrades” predicates to the level of data (for example, true(advised(X,Y)) or, closer to Warren’s style, true(advised,X,Y)). The problem of awkwardness and other limitations of Warren’s proposal are overcome by HiLog [Chen et al. 1989b, Chen et al. 1989a, Chen et al. 1993], which is a full-blown logic that admits variables over predicates, function symbols and more. While HiLog has a higher-order syntax, its semantics is first-order. As a result, it provides a convenient higher-order syntax and yet computationally it is not more expensive than Prolog. For instance, the aforesaid query ?- R(lvk, ks), which has a variable in the predicate position, is syntactically correct and has the expected semantics. More interestingly, HiLog supports parameterized predicates, which can significantly simplify working with Datalog and make rules more generic. To understand this idea, suppose we have various binary relations, such as parent, edge, and direct_ flight. In all of these cases, transitive closure creates a new, meaningful, and commonly used concept. For instance, by closing parent transitively, we get the concept of an ancestor—also a binary relation. Similarly, the transitive closure of edge gives us the concept of a path in a graph; the transitive closure of direct_ flight is also a meaningful concept—the ability to travel between cities by air with

38

Chapter 1 Datalog: Concepts, History, and Outlook

stops in-between. In Datalog and in Prolog, one would have to write three different rule sets to define these different closures. Here is one such a set for parent: ancestor(A, D) :- parent(A, D). ancestor(A, D) :- parent(A, C), ancestor(C, D).

For edge, the rules would be similar except that the names of the predicates would change. To make such rules generic, it is not enough to just put variables in place of the predicates: one also needs to be able to construct new names for relations because if parent were replaced by edge, then ancestor would need to be changed to some other name that depends on the underlying predicate. Here is a solution to the problem in HiLog: closure(Pred)(A, D) :- Pred(A ,D). closure(Pred)(A, D) :- Pred(A, C), closure(Pred)(C, D).

Here Pred is a variable that can be bound to parent, edge, or direct_flight. If, say, Pred is bound to direct_flight, then the rules above define the relation closure(direct_flight). If Pred is bound to edge then these rules define closure(edge). Details can be found in Chen et al. [1993]. HiLog is available in a number of systems. First, it is supported in XSB, but its usability there is limited because it is not integrated with XSB’s module system. HiLog is fully supported by and is extensively used in the Flora-2 system [Kifer 2015, Yang et al. 2003] and in its commercial cousin, the Ergo reasoner [Coherent Knowledge LLC 2017]. We mention another important extension in this area, Lambda Prolog [Miller and Nadathur 1986]. Like HiLog, Lambda Prolog supports second-order syntax. However, unlike HiLog, it is semantically a second-order logic. As a result, Lambda Prolog has a much greater expressive power than HiLog, but it is also more expensive computationally.

1.4.8

Datalog and Updates The need to endow logic-based declarative languages with an ability to change the state of the underlying database has been recognized early on, and both Prolog and SQL—the two earliest and most prominent such languages—have the facilities to do so. The trouble is that neither facility was wholly satisfactory. In Prolog, database updates are performed via assert and retract, “predicates” with side effects. In SQL one uses INSERT, DELETE, and UPDATE statements. We put the term “predicates” inside the quotes because neither assert nor retract are predicates

1.4 Extensions to Datalog

39

in that they do not state or test any kind of a truth-valued statement. Instead, they perform extra-logical operations of inserting or deleting information, and their semantics can be explained only procedurally and non-logically, hence not declaratively. In SQL, the situation is similar, but, realizing the theoretical difficulty of integrating update operators into a declarative language, the designers relegated these operators to a separate sublanguage. Then, to put everything together, they came up with a host of less than ideally designed procedural languages for so-called stored procedures, completely abandoning any pretense of declarativeness. This state of affairs has been deemed unsatisfactory by a long list of researchers and practitioners leading to an equally long list of proposed fixes. Few have gained traction, however. Bonner and Kifer give a comprehensive survey of many of these proposals [Bonner and Kifer 1998a]. In this section, we review some of the approaches from that earlier survey and also cover some newer proposals. In addition, we will try to classify the different approaches along several dimensions. The approaches to updates in logic languages can be roughly classified into two broad categories: explicit state identifiers and destructive updates. The latter are further subdivided into updates in the rule heads and updates in the rule bodies. There are also approaches based on other logics, such as dynamic logic [Harel 1979] and process logic [Harel et al. 1982], which we do not cover here, but they are covered in the survey cited above. What features should one expect or want from integrating an update capability into a logic language, beyond the basic modification of data? The following desiderata are the most important features, in our view, and they will serve as additional classification dimensions for the different proposals. 1. Declarative semantics. This requirement roughly means that one would like the integration to be as smooth as possible and be declarative. Prolog and SQL clearly fail soundly on this score. 2. Subroutines, compositionality. Every programming language worth its salt has subroutines, and programming without them is pretty much unthinkable these days. Prolog and SQL let the user define derived predicates (or views, as they are known in SQL), which act much like subroutines in procedural languages. These derived predicates can be further combined into more complex predicates—similarly to other programming languages. When it comes to updates, however, some approaches falter on this score. For instance, Prolog supports subroutines that perform updates, but SQL views cannot.

40

Chapter 1 Datalog: Concepts, History, and Outlook

3. Reactive rules. A reactive system is one that can execute certain actions in response to internal or external events. This capability is related to our topic because these events can be viewed as updates and reactivity has been identified as an important paradigm in building large systems. Reactive rules, which usually come in the form of event-condition-action rules (or ECA rules), are a natural adaptation of this paradigm to declarative languages. In SQL, for example, this idea is realized through triggers. Prolog, on the other hand, does not have explicit support for this paradigm, but it can be simulated. 4. Reasoning about actions. If a robot picks up a block from the top of another block, will the top of that other block become clear in the next state? It seems like an obvious question, but can this statement be proven within the same logical language that is being used to specify the states of such a “blocks world” and the actions of a robot in it? Most of the approaches surveyed here do not support such reasoning. We will now briefly survey the three aforementioned categories of approaches to updates and discuss them vis-` a-vis the above desiderata. 1.4.8.1

Approaches Based on Explicit State Identifiers The oldest and best-known approach in this category is the situation calculus [McCarthy and Hayes 1969], which is still widely used for reasoning about actions, albeit not in Datalog or other logic-programming languages. The idea is to use one designated argument of each state-dependent predicate to hold a state identifier. For instance, if the initial state is denoted s0 then the state obtained by picking up block a in state s0 and then putting that block down on top of block b would be represented as do(putdown(a, b), do(pickup(a), s0)). Thus, the state identifier reflects the history of actions that brought about that state. The effects of the various actions are specified via logical formulas. For instance, in the following example, the designated state argument is the last one: clear(BlkB,NewState), holding(BlkA,NewState) :NewState = do(pickup(BlkA),S), on(BlkA,BlkB,S), possible(pickup(BlkA),S). possible(pickup(Blk),S) :- clear(Blk,S).

This formula says that if in state S block a is on some block X and it is possible to execute the action of picking up a, then in the new state, denoted do(pickup(a), S), block X would be clear and the robot would be holding block a.

1.4 Extensions to Datalog

41

One of the issues with situation calculus and the other approaches that rely on state identifiers is the frame problem. To understand the issue, consider the example above. The rule that specifies the effects of the pickup action deals with the direct effects of that action, but it says nothing about what did not change. Indeed, suppose that in state s0 we had on(d , e, s0). Intuitively, picking up block a should not affect the fact that block d is sitting on top of block e, so we would expect that on(d , e, do(pickup(a), s0)) is true. However, there is no way to derive this fact given the rules above. The missing piece of the puzzle can be provided using frame axioms, which state that the facts that are not directly affected by an action remain true in the state resulting from that action. There are two problems with frame axioms. The first is that the number of such axioms can be large. The original solution [Green 1969] required a quadratic number of frame axioms (predicates × actions). A more feasible solution, which required only one frame axiom, was presented by Kowalski and Sergot [1986]. It was well received by database and logic programming communities, but not by some AI researchers because that solution relied on the closed-world assumption, which goes beyond first-order logic. A purely first-order logic solution, requiring a linear number of frame axioms, was later proposed by Reiter [1991]. The other problem with frame axioms is that, as the system evolves, inference can slow down significantly. For instance, after 10,000 actions, finding out what is true in the current state might require a 10,000-element chain of inferences via the frame axioms. With respect to the desiderata stated earlier, situation calculus scores well on the first and the fourth criteria (declarativeness and reasoning), but it does not do well when it comes to the second and third criteria. Among the other approaches that rely on state identifiers [Chomicki 1990, Kowalski and Sergot 1986, Lausen and Lud¨ ascher 1995, Zaniolo 1993], the event calculus [Kowalski and Sergot 1986] is the best-known one. Unlike the situation calculus, these approaches use time (continuous or discrete) as a state identifier and they rely on the closed-world assumption to specify the frame axioms, which greatly reduces the number of such axioms. Apart from that, they suffer from some of the same limitations as the situation calculus but score well on the first and the fourth desiderata. Here is an example expressed in the event calculus: initiates(pickup(Blk)), holding(Blk)). terminates(putdown(Blk), holding(Blk)). holds(Fact, Time2) :- initiates(Evnt,Fact), happens(Evnt, Time1), Time1 < Time2,

42

Chapter 1 Datalog: Concepts, History, and Outlook

not ∃ Evnt2, Time3 ( happens(Evnt2, Time3), terminates(Evnt2, Fact), Time1 =< Time3 Amt, balance(Acct2,Bal2), retract(balance(Acct1,_)), retract(balance(Acct2,_)), NewBal1 is Bal1 - Amt, NewBal2 is Bal2 + Amt, assert(balance(Acct1,NewBal1)), assert(balance(Acct2,NewBal2)).

By itself, the transaction above will always work correctly and will never leave us in an incoherent state. However, combining this transaction with itself to make two transfers: ?- transfer(100,Acct1,Acct2), transfer(200,Acct1,Acct3).

may run into problems, if Acct1 has balance less than $300. Indeed, the first transfer may come through correctly, but the second will fail due to the check Bal1 > Amt in the second transfer. If the two transfers were supposed to be done together (e.g., the seller and the broker must both be paid) then we are left in an incoherent state.

1.4 Extensions to Datalog

45

A popular solution to “fixing” Prolog’s update operators is provided by Transaction Logic [Bonner and Kifer 1993, Bonner and Kifer 1994]. The intuitive idea is to require that all subroutines must be transactional (whence “Transaction Logic”). This qualification means that colorGraph and the double-transfer above must be treated as transactions, and one of the key aspects of a transaction is its atomicity. Atomicity refers to the property that a transaction either executes in its entirety or not at all. If a successful execution exists then the updates performed during the transaction execution are committed and become visible. If a successful execution is not possible, all the intermediate updates are undone and the database is left in the original state—as if nothing happened at all. In case of the double-transfer above, if Bal1 is greater than $300 then the transaction succeeds and the amounts are transferred out of Acct1. Otherwise, the first or the second transfers will fail and the partial changes made to the database are undone. In case of graph coloring, the program will keep searching for a correct assignment of colors. If, at some point, the condition not (adjacent(N, N2), colored(N2, C)) cannot be satisfied, some previous color assignments will be undone and other assignments will be tried until either a correct assignment is found or all assignments have been tried and rejected. Practically, this trial-and-error is implemented through backtracking and so such transactional updates are sometimes called backtrackable. In brief, Transaction Logic extends the classical logic with one new connective, the serial conjunction, denoted ⊗. Further extensions (such as concurrency, hypothetical updates, defeasible updates) introduced additional connectives [Bonner and Kifer 1995, Bonner and Kifer 1996, Fodor and Kifer 2011]. The Transaction Logic version of the graph-coloring example above thus looks very similarly to Prolog: colorGraph :- not uncoloredNode(_). colorGraph :- colorNode ⊗ colorGraph. colorNode :uncoloredNode(N) ⊗ color(C) ⊗ not (adjacent(N, N2) ⊗ colored(N2, C)) ⊗ tinsert(colored(N, C)). uncoloredNode(N) :- node(N) ⊗ not colored(N, _).

where tinsert is a transactional insert operator. The main difference with Prolog is, of course, the transactional semantics of the program above, which is defined by extending the standard model theory of first-order logic. Details can be found in Bonner and Kifer [1993, 1994, 1995]. In summary, the original Transaction Logic takes care of the first two desiderata on our list. A follow-up work [Bonner et al. 1993] showed that this logic is good at

46

Chapter 1 Datalog: Concepts, History, and Outlook

modeling reactive systems (specifically, event-condition-action rules), thereby fulfilling desideratum 3. The use of this logic for reasoning about actions (desideratum 4) was discussed in Bonner and Kifer [1998b] and further significant developments in this direction appeared in Rezk and Kifer [2012]. Transaction Logic also has applications to planning and Web services [Basseda and Kifer 2015b, Basseda and Kifer 2015a, Basseda et al. 2014, Roman and Kifer 2007]. Parts of this logic have been implemented in the Flora-2 system [Kifer 2015, Yang et al. 2003] and a standalone interpreter is available.17 Finally, we mention the approach in the updates-in-rule-body category used in LDL—one of the earliest Datalog systems [Naqvi and Krishnamurthy 1988, Naqvi and Tsur 1989]. Due to the problems with the non-declarative nature of Prolog’s update operators, this system restricts the syntactic forms of the rules in which update operators can appear. Although this requirement makes it possible to give a logical semantics to such rules, the restrictions themselves are severe. For instance, they prohibit recursion through updates as well as most post-conditions, which excludes both the graph-coloring and the double-transfer examples discussed earlier.

1.5

Evaluation Techniques The intended uses of Datalog required rethinking the evaluation mechanisms used for Prolog and other logic-based question-answering systems. Because of the use of definite clauses, Prolog was able to base its query evaluation mechanism on SLD resolution, using depth-first search over the SLD tree based on the rules and facts. We illustrate this top-down approach later. While Prolog-style top-down evaluation worked reasonably for many cases, particularly those where the rules and facts fit in main memory, there were several considerations that dictated developing new techniques for Datalog. All derivations vs. all answers. Prolog and other logic programming approaches generally focused on finding a solution to a problem or deciding the truth of a statement. (Are X and Y related?) Top-down approaches can be efficient in such cases, because the sequence of rule applications is directed by the current goal, where bottom-up approaches can derive a lot of extraneous facts that are not connected to the goal. Datalog was targeting database settings, where a collection of answers is expected. (Who are all the people related to Y ?) While Prolog-style top-down evaluation can generate multiple answers by backtracking to choice points, this process may be inefficient, 17. http://flora.sourceforge.net/tr-interpreter-suite.tar.gz

1.5 Evaluation Techniques

47

because there can be more than one way to derive the same answer. (In deductive terms: there can be more than one proof for some facts.) For example, if persons A and B are related because of a common ancestor C, then their relatedness can also be deduced from any ancestor D of C. Moreover, with some rule sets, there can be more than one way to derive relatedness using C. If the number of derivations greatly exceeds the number of facts, Prologlike approaches can be inefficient for finding all answers. Datalog techniques tried to avoid multiple derivations, for example, by remembering previously derived facts, and avoiding re-deriving them (in top-down approaches) or excluding them when trying to derive new facts (in bottom-up approaches). Recursion. A key advantage of Datalog over other relational languages, like SQL, was the natural expression of recursion in queries. The top-down, leftto-right evaluation style of Prolog can behave differently depending on how rules are written. Some rules can cause infinite regress in trying to prove a goal. For example, if the first rule for related were related(P1, P2) :- related(P1, C), advised(C, P2).

Prolog evaluation would keep solving the first goal with a recursive application of the same rule, and never actually encounter rules or facts for the advised predicate. Even when rule sets are better constructed, there can still be an unbounded number of search paths even when there are only finitely many answers. In fact, if the SLD tree is infinite and all answers are desired, no complete search strategy will terminate. Here the bottom-up approaches have a decided advantage, as termination is guaranteed if the set of answers is finite (though bottom up can re-derive facts). Large fact bases. Prolog implementations largely assumed that rules and facts all fit in memory. However, this assumption was questionable for applications of Datalog to large datasets. If virtual memory is required for the fact base, then accessing facts turns into disk accesses. Different binding patterns for goals lead to different orders of access, hence not all disk accesses will be clustered, leading to random reads on disk. If there is alternative storage for facts (in the file system or a database), then there can be costs associated with moving facts into internal memory structures—such as the parsing associated with the Prolog assert command. For efficient access to large fact bases on disk, data needs to be read in chunks, and manipulated in its format on disk. A great deal of the early work on Datalog (and deductive databases more generally) centered on alternative evaluation mechanisms. These approaches generally

48

Chapter 1 Datalog: Concepts, History, and Outlook

started either from basic bottom-up (BU) or top-down (TD) evaluation. The bottomup methods tried to avoid re-deriving established facts, and providing more direction as to which facts to derive. Top-down methods tried to access facts in groups (rather than one at a time), rearrange the order of sub-goals, generate database queries to solve goal lists, and memoize derived facts to break recursive cycles. Theoretical proposals for alternative evaluation methods proliferated, with further and further refinements. However, side-by-side comparisons of actual implementations of these approaches were not common (although there was some comparative analysis of methods [Bancilhon and Ramakrishnan 1986]). The potential for scaling and performance of logic languages was promoted, such as via parallel evaluation in the Japanese Fifth-Generation Project [Moto-oka and Stone 1984] or using database techniques such as indexing and query optimization.

1.5.1 Bottom-Up Methods In Section 1.1, we described the bottom-up evaluation of a Datalog program P at the level of single instantiations of rules to generate new facts. This derivation process is commonly organized into “rounds,” where in one round we consider all satisfied instances of all rules. A round can be viewed as a transformation TP , called the immediate consequence operator, that takes a set of established facts F and produces a set G of newly established facts [Van Emden and Kowalski 1976]. If we let F0 represent the initial EDB facts, then bottom-up evaluation can be seen as producing a sequence of fact-sets F0 , F1 , F2 , . . . where Fi = TP (Fi−1) ∪ Fi−1 For a Datalog program, we eventually reach a fixed point Fj where Fj = Fj −1 (that is, a point at which no new facts can be established using rules in program P ). Call this fixpoint F ∗. At this point, a query over the virtual database embodied by P can be answered. This iterative application of TP is called the na¨ ıve approach to bottom-up evaluation. There are two big problems with the na¨ıve approach: repeated derivation and irrelevant results, which give rise to more sophisticated evaluation strategies. Repeated Derivation. Note that Fi ⊇ F by construction. For a standard Datalog

program P , TP is monotone in the sense that adding more facts to the input cannot reduce the number of facts in the output. Thus TP (Fi ) ⊇ TP (Fi−1). So each round in the na¨ive approach re-establishes all the facts from the previous round, which is wasted work. The semi-na¨ ıve approach addresses this problem by only using rule instances in a round that have not been tried in previous rounds. It does so by determining which are the new facts generated in a round, then making sure any

1.5 Evaluation Techniques

49

rule instance used in the next round contains at least one of the new facts. Thus, semi-na¨ıve determines the newly established facts at round i as Di = Fi − Fi−1 and restricts TP to only generate facts if a rule instance uses at least one fact from Di . Rather than examining rule instances on a one-by-one basis, semi-na¨ıve constructs an alternative rule set Q using a version rN and rO of each predicate r, where rN contains the “new” r-facts from Di and rO the “old” facts from Fi−1. So, for example, the rule r(X, Y) :- r(X, Z), r(Z, Y).

in program P would be replaced by three rules in program Q: r(X, Y) :- rN(X, Z), rO(Z, Y). r(X, Y) :- rN(X, Z), rN(Z, Y). r(X, Y) :- rO(X, Z), rN(Z, Y).

where rN and rO are computed by the system at each iteration anew. Note that the semi-na¨ıve approach does not completely avoid repeated derivation of facts, since distinct rule instances can establish the same facts independently. For example, the rule instances r(a, b) :- r(a, c), r(c, b). r(a, b) :- r(a, d), r(d, b).

both establish r(a, b). A systematic method based on incrementalization [Liu and Stoller 2009] has been developed to ensure that each combination of facts that makes two subgoals of a rule true simultaneously is considered at most once, and furthermore to provide precise time and space complexity guarantees. The method compiles a set of Datalog rules into a stand-alone imperative program. The generated program does a minimum addition of one fact at a time, incrementally maintains the results of expensive computations after each addition, and uses a combination of indexed and linked data structures to ensure that each firing of a rule takes worst-case constant time in the size of the data. Irrelevant Derivation. The variants of bottom-up we have described so far generate

the entire virtual database before turning to determining which results match the goal. Some facts that are non-goal results are nevertheless needed in order to establish facts that are goal-facts. But there can be facts in the virtual database

50

Chapter 1 Datalog: Concepts, History, and Outlook

that neither match the goal nor are required to derive the goal facts. For example, suppose the related predicate had only the first two rules: related(P1, P2) :- advised(A, P1), advised(A, P2). related(P1, P2) :- advised(B, P1), related(B, P2).

and the goal is related(P, ks). Then any derived fact related(a, b) where b = ks will not be in the result, nor will it be useful in deriving goal facts. Aho and Ullman [1979] point out that in certain cases optimizations used for relational databases are applicable. For example, we can use “selection push-down” to move the restriction that P2 = ks into the rule bodies to get related(P1, ks) :- advised(A, P1), advised(A, ks). related(P1, ks) :- advised(B, P1), related(B, ks).

Note that this strategy works because the bound value stays in the same position. However, if our goal were related(ks, P), the binding in the goal does not directly restrict the binding in the rule body. Notice, however, that the bindings for the recursive call to related(B, P2) in the second rule are not totally unrestricted. If we think about the top-down evaluation of this rule, we see that the only useful facts for related(B, P2) are those where B is an ancestor of ks. In the genealogy database, the set of ancestors of a single person is likely a much smaller set than the set of all known people in the EDB. Thus, for example, if we computed ks-ansc(Y) :- advised(Y, ks). ks-ansc(Y) :- ks-ans(Z), advised(Y, Z).

initially, to find all the ancestors of ks, we could modify the second rule to be related-ks(P1,P2) :ks-ansc(B), advised(B,P1), related(B,P2).

With bottom-up evaluation, this rule would only establish facts where B is an ancestor of ks, effectively eliminating the derivation of many of the irrelevant facts. This idea of deriving an auxiliary predicate that mimics the propagation of topdown bindings (but which can be evaluated bottom up) is the basis of the Magic Sets approach [Bancilhon et al. 1986]. Magic Sets tries to avoid derivation of irrelevant facts by limiting bindings based on constants in the final goal. It might be seen as top-down direction of bottom-up evaluation. The full Magic Sets method is more complicated than the example above, as it has to track binding propagation through all the rules of a program, and different occurrences of the same predicate in a program may be constrained differently. In the worst case, this variation may lead to an exponential blow-up in the number of intermediate rules that Magic Sets might generate in order to track all such bindings.

1.5 Evaluation Techniques

51

Optimization of the bottom-up evaluation with respect to the goal has been studied extensively—both as generalizations of selection push-down and as variants of the Magic Sets ideas discussed above. Selection push-down is a specific instance of the general idea of propagating constants in the queries or in the rules so that only the relevant facts are derived, if possible. This strategy is analogous to partial evaluation in the programming languages literature [Jones et al. 1993]. Other evaluation methods include static filtering [Kifer and Lozinskii 1990] and dynamic filtering [Kifer and Lozinskii 1987, Kifer and Lozinskii 1988], which limit the facts being processed during evaluation, as well as rewrite methods such as specialization [Tekle et al. 2008], which rewrites the program to achieve the same effect given any evaluation strategy. As briefly described above, Magic Sets [Bancilhon et al. 1986] limits inference during bottom-up evaluation by introducing magic predicates that infer a relevant subset of values for arguments of predicates by using the binding of variables in other goals in a rule, an example of sideways information passing. In the years following the initial work on Magic Sets, many variants have been developed including supplementary Magic Sets [Sacc` a and Zaniolo 1986b], which saves space and time by introducing intermediate predicates that can be reused, generalized a and Zaniolo 1986a], which opticounting [Beeri and Ramakrishnan 1991, Sacc` mizes rules by keeping track of which rules were used to infer a particular fact, and Magic Templates [Ramakrishnan 1991], which generalize to Horn clauses with function symbols. Beeri and Ramakrishnan [1991] provide a formal discussion of these methods. One serious problem with Magic Sets and its derivatives is that the same fact may be manifested an exponential number of times in several variants of the same predicate, since Magic Sets introduces a new copy of each original predicate for every possible binding pattern for the arguments of that original predicate. For example, consider the reachability rules from Section 1.4.1 and a query with both arguments bound (such as reachable(a, b)). Magic Sets would produce the following rules and a fact, where reachable_bb is a copy of reachable created for the calls where both arguments are bound, and reachable_bf is a copy for the calls where the first argument is bound and the is second free, and so forth. reachable_bb(X,Y) :- m_bb(X,Y), edge(X,Y). reachable_bf(X,Y) :- m_bf(X), edge(X,Y). reachable_bb(X,Y) :- m_bb(X,Y), reachable(X,Z), edge(Z,Y). reachable_bf(X,Y) :- m_bf(X), reachable(X,Z), edge(Z,Y). m_bf(X) :- m_bb(X,Y). m_bf(X) :- m_bf(X). m_bb(a, b).

52

Chapter 1 Datalog: Concepts, History, and Outlook

It can be seen that the same reachability fact may be associated with both reachable_bf and reachable_bb. In general, one predicate may get blown up into as many as 2k variants, where k is the number of arguments of the predicate (for different sequences of b’s and f’s), and so the same fact may be represented an exponential number of times. In addition, note that this method can generate spurious rules that infer no facts, such as the second-to-last rule. This situation arises from hypotheses that generate no “new” magic facts, analogous to the generation of a subquery that is isomorphically equivalent to the head of a rule in top-down evaluation. More recently, a novel method, called demand transformation [Tekle and Liu 2010], has been introduced, which solves the problem of propagating top-down bindings cleanly, including avoiding the problem above and in other Magic Sets variants. Demand transformation avoids the blow-up by storing facts for the same predicate together, and using an incremental evaluation strategy with carefully selected data structures to ensure that each firing of a rule takes constant time [Liu and Stoller 2009]. It has also been shown that bottom-up evaluation of rules generated by demand transformation has the same time complexity as the top-down evaluation with variant tabling as implemented in various systems, including XSB [Swift and Warren 2011] and YAP [Costa et al. 2012]. Variant tabling reuses answers from previous queries that are identical to the current query up to variable renaming— see Section 1.5.2.1. In contrast to variant tabling, subsumptive tabling, also implemented in XSB, reuses answers from previous queries that subsume (i.e., are more general than) the current query. A method that has the same performance characteristic as the top-down evaluation with subsumptive tabling, called subsumptive demand transformation [Tekle and Liu 2011], has also been developed. What distinguishes Demand Transformation (and Subsumptive Demand Transformation) from Magic Sets and variants is not just that it is simpler, but more importantly, that it provides precise time and space complexity guarantees [Tekle and Liu 2010, Tekle and Liu 2011]. In particular, it improves over Magic Sets asymptotically, even exponentially.

1.5.2 Top-Down Evaluation We now turn to top-down evaluation. This strategy starts with a query that is represented by one or more goal literals and compares each goal against the heads of the various rules in the program. If we get a match, possibly substituting for variables, then we replace the goal with new subgoals from the rule’s body (noting that the body is empty when matching against EDB facts). If all the subgoals are eventu-

1.5 Evaluation Techniques

53

ally dispatched, then the collection of all variable substitutions used in the process can provide an answer to the original query. By considering alternative matches to subgoals, we can generate additional answers to the query. For example, consider an EDB predicate cited that indicates that the person in the first argument has cited the person in the second argument. We also define an IDB predicate influenced that is the transitive closure of cited. (A more precise definition would consider the date of citation, but we do not do it here.) cited(dsw, dm). cited(tt, dsw). cited(tjg, dm). cited(dsw, tjg). influenced(P1, P2) :- cited(P2, P1). influenced(P1, P2) :- cited(P, P1), influenced(P, P2).

Suppose we have the query goal ?- influenced(dm, X).

We can match this goal against the head of the first rule, substituting dm for P1 and X for P2. The new goal is ?- cited(X, dm).

Matching this goal against the cited-facts gives two query answers: influenced(dm, dsw). influenced(dm, tjg).

The initial goal can also be matched against the second rule, with the same substitutions, yielding the goal list ?- cited(P, dm), influenced(P, X).

The first goal here can be matched to the fact cited(dsw, dm), yielding the goal ?- influenced(dsw, X).

Matching this subgoal to the first rule, and then matching the result to the fact cited(tt, dsw) gives the additional query answer influenced(dm, tt).

There are some choices in the top-down approach, in terms of the order that goals and rules are considered. The standard for logic-programming languages is

54

Chapter 1 Datalog: Concepts, History, and Outlook

to proceed depth first, always working on the first subgoal, and trying rules in order. However, there are other strategies, such as choosing the subgoal to work on based on the number of bound or unbound variables, proceeding breadth first. 1.5.2.1

Memoing (Tabling) Approaches One big problem with top-down computation is that it can lead to an infinite recursion because a subgoal may be defined in such a way that it depends on itself. Consider the subgoal influenced(dsw, X) above. If we match it against the second (recursive) rule for influenced and match the first subgoal of that rule with the fact cited(tjg, dm), we get the subgoal ?- influenced(tjg, X).

When we use the second rule on this new goal, we can match its first subgoal in the body of that rule with cited(dsw, tjg), we get the already seen subgoal ?- influenced(dsw, X).

Thus, a na¨ıve top-down strategy would lead to an infinite loop. Resolving a subgoal repeatedly as it arises in different contexts can obviously lead to inefficiency in evaluation, not to mention infinite recursion. Unlike Prolog, where it is possible to have an unbounded sequence of distinct subgoals, Datalog queries with finite EDBs can lead only a finite number of different subgoals, up to renaming of variables. We can avoid redundant work and endless loops if we keep track of prior subgoals and their answers [Warren 1992]. We can then avoid a recursive call to a previously seen subgoal (or a more specific version of a previous subgoal). Furthermore, if we record previous answers to a given subgoal, then, if we encounter the subgoal again, we can look up the answers already found rather than computing them again. This strategy is an example of memoization, or tabling, which can be viewed as a variant of dynamic programming [Warren 1992]. Under tabled evaluation, every encountered subgoal is remembered in a table of subgoals, and every answer computed for a subgoal is remembered in the answerstable associated with this subgoal. No answer is added to the table of a subgoal more than once. Whenever a subgoal is encountered during evaluation, it is checked against the table to see if it has been previously encountered and is thus already evaluated or is in the process of being evaluated. If it is in the table, then it is not re-evaluated, but instead its answers from the table are returned as answers to the current subgoal invocation. It may happen that not all the answers for a subgoal (or even none of the answers) are in the table when it is subsequently encountered,

1.5 Evaluation Techniques

55

so the computation must be able to suspend and resume later, when answers show up in the table for this subgoal. When a goal is looked up in the table to see if it has been previously called, there are two options: one can look for a goal that is the same as the current goal up to variable renaming; this option is called variant tabling. Alternatively, one can look for a more general goal that subsumes the current goal.18 In this case, all the answers for the subsumed goal will be answers to the subsuming goal, so a subsumed goal can get its answers from the table of the subsuming goal. Using subsuming calls is known as subsumptive tabling. Our examples below assume variant tabling. Consider again the definition of the influenced relation, but with the instance of cited shown below: influenced(P1, P2) :- cited(P2, P1). influenced(P1, P2) :- cited(P, P1), influenced(P, P2). cited(dsw, dm). cited(dm, dsw). cited(tjg, dsw).

We want to pose a query to find all the colleagues that dm influences. In Prolog, this program goes into an infinite loop, repeatedly generating dsw and dm, because Prolog’s evaluation strategy tries to enumerate all the paths in the graph and there are infinitely many of them.19 Under tabled evaluation, subgoal calls and their corresponding answers are remembered and re-used. Answers are returned only once and calls are made only once, with subsequent calls using the answers returned by the initial call. Consider the following trace of tabled evaluation of the query ?- influenced(dm, COLL). 1 2 3 4

influenced(dm, COLL) newgoal: influenced(dm, COLL) cited(COLL, dm) ans: cited(dsw, dm)

18. Syntactically, G subsumes F if F can be derived from G by substituting 0 or more variables with constants or other variables. So, for example, influenced(X, Y) subsumes influenced(dm, Y), influenced(X, X), influenced(X, Y), and influenced(dm, tjg). 19. Prolog pedagogues, when teaching data recursion, wisely choose graphs, for example, the advises relation, that can have no cycles.

56

Chapter 1 Datalog: Concepts, History, and Outlook

5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

newans: influenced(dm, dsw) (cited(_h397, dm), influenced(_h397, COLL)) cited(_h397, dm) ans: cited(dsw, dm) influenced(dsw, COLL) newgoal: influenced(dsw, COLL) cited(COLL, dsw) ans: cited(dm, dsw) newans: influenced(dsw, dm) ans: cited(tjg, dsw) newans: influenced(dsw, tjg) (cited(_h795, dsw), influenced(_h795,COLL)) cited(_h795, dsw) ans: cited(dm, dsw) influenced(dm, COLL) ans: influenced(dm, dsw) from table newans: influenced(dsw, dsw) ans: cited(tjg, dsw) influenced(tjg, COLL) newgoal: influenced(tjg, COLL) cited(COLL, tjg) (cited(_h1371, tjg), influenced(_h1371, COLL)) cited(_h1371, tjg) ans: influenced(dsw, dm) from table newans: influenced(dm, dm) ans: influenced(dsw, tjg) from table newans: influenced(dm, tjg) ans: influenced(dsw, dsw) from table ans: influenced(dm, dm) from table ans: influenced(dm, tjg) from table ans: influenced(dm, tjg) from table ans: influenced(dm, dm) from table ans: influenced(dm, dsw) from table done

The non-indented lines indicate additions to the table: either a new goal or a new answer to an existing goal. The posed goals are simply added to the table as is, and returned answers are prefixed by “ans:”. In lines 1–5, influenced(dm, COLL) is called, which in turn calls cited(COLL, dm) and returns the answer dsw; the new goal (on line 2) and its new answer (on line 5) are then added to the table. On line 10, evaluation of the second rule for influenced(dm, COLL) begins, with computation continuing as in Prolog (except for the table-update side effects) to

1.5 Evaluation Techniques

57

line 19. At line 19, another call to influenced(dm, COLL) is posed. A variant of this goal (in this case, identical to this very goal) is found in the table and so influenced(dm, COLL) is not actually called; instead the answers in the table for that goal are returned. Specifically, we see on line 20 that the answer of dsw is returned from the table. This step leads to a new answer from the second clause for the goal influenced(dsw, COLL), with COLL = dsw. This fact is added to the table on line 21. Then, backtracking to the first goal of that second clause returns tjg as the other answer to the subgoal cited(_h795, dsw). This step leads to the new call to influenced(tjg, COLL) in line 23, which gets added to the table, and the search through line 27 determines that tjg cited no one. At this point the contents of the table is: influenced(dm, XX): XX=dsw influenced(dsw, XX): XX=dm; XX=tjg; XX=dsw influenced(tjg, XX):

Now we can return answers from the table to suspended goals. Returning the answer dm to influenced(dsw, COLL) results in the new answer dm to the goal influenced(dm, COLL), which gets added to the table in line 29. Similarly for tjg in lines 30 and 31. Lines 32–35 show other answers being returned to calls in the table, but they generate only answers that are already in the table, so they are simply no-ops. At this point all answers have been returned to all goals, and the computation is complete. The table now contains all the answers to the posed goal (and all generated subgoals). Consider now a different, logically equivalent, way of writing the influenced rules: influenced(P1, P2) :- cited(P2, P1). influenced(P1, P2) :- influenced(P1, P), cited(P2, P).

No experienced Prolog programmer would ever try to write the transitive closure in this form, since the immediate recursion of the second clause would always result in an infinite loop. However, with tabling, such rules are easily handled and, in fact, this left-recursive form of transitive closure is more efficient! Consider the following trace of a tabled evaluation of this left-recursive form for transitive closure with the same facts as before: 1 2 3 4

influenced(dm, COLL) newgoal: influenced(dm, COLL) cited(COLL, dm) ans: cited(dsw, dm)

58

Chapter 1 Datalog: Concepts, History, and Outlook

5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

newans: influenced(dm, dsw) (influenced(dm, _h330), cited(COLL, _h330)) influenced(dm, _h330) ans: influenced(dm, dsw) from table cited(COLL, dsw) ans: cited(dm, dsw) newans: influenced(dm, dm) ans: cited(tjg, dsw) newans: influenced(dm, tjg) ans: influenced(dm, dm) from table cited(COLL, dm) ans: cited(dsw, dm) ans: influenced(dm, tjg) from table cited(COLL, tjg) ans: influenced(dm, tjg) from table done

Here, again, lines 1–6 are the same as in Prolog, with the side effects that update the goal–answer tables. At line 7 we see the recursive call to influenced(dm, COLL). This goal (up to variable renaming) is found in the table, so it is not called again but instead the answers in the table are returned to it, which leads to new answers: dsw is returned from the table in line 8, which results in a new answer of dm on line 11; and of tjg is returned on line 13. Then the new answer of dm is returned from the table but that does not lead to new answers. Finally, tjg is returned but that again does not lead to any answers. Looking at these two programs and the single-source query, it is interesting to compare the different algorithms. We see that the second derivation is shorter, and this aspect is not an accident! Say we have a database with N people in which everyone cited everyone else. Note that with the right-recursive definition we will get N tabled subgoals, each of the form influenced(person, XX) for the N persons. And each tabled subgoal will have N answers, one for each person cited. So this table has size O(N 2). For the left-recursive definition, we have only one tabled subgoal, influenced(dm, COLL)—for the initial query—and that subgoal will have only N answers. So the size of the table is O(N). It can be shown that O(N 2d) and O(N d), where d is the maximum node out-degree, are the respective worst-case time complexities of the two algorithms. Tekle and Liu [2010, 2011] present precise time and space complexity analysis for top-down evaluation with variant tabling and subsumptive tabling using finegrained complexity factors.

1.5 Evaluation Techniques

1.5.2.2

59

Other Top-Down Approaches We briefly mention some variations on the top-down strategy aimed at evaluation of Datalog (or more general) programs. Tamaki and Sato [1986] introduced OLDT resolution, which is an iterative-deepening approach with tabling of answers from previous iterations. This paper was the first formal specification of a top-down tabling algorithm for logic programming, but it did not include an implementation. The Extension Table (ET) evaluation mechanism [Dietrich 1987, Dietrich and Warren 1986] is a form of tabling that can be implemented easily in Prolog. It uses two additional structures for each predicate p to which it is applied. One we call et_ p, for the extension table for p. It has the same columns as p, and is used to record previously derived facts for p. The other is call_p, which contains the patterns of variables and constants used in subgoals for p encountered so far. These predicates record the subgoals and their answers, and are updated, or used, when subgoals are invoked and answers are returned. However, this simple Prolog-based algorithm is not complete. Consider for example the left-recursive influenced/2 definition, with the order of the two rules reversed. Then the initial goal, influenced(dm, COLL), will immediately call itself. But there are no answers for it yet, so it must fail back to try other paths to generate answers. Unless computation somehow comes back to restart that “failed” subgoal call, some answers may be lost. One might think that by re-ordering clauses, one can avoid this problem, but that strategy does not work in general. For completeness, subsequent subgoals must either suspend, to be resumed later if necessary, or must somehow be regenerated. To be complete, the ET algorithm repeatedly iterates its computation until the table predicates do not change. Variations and extensions of the ET algorithm have been developed and implemented in some Prolog systems including B-Prolog by Zhou et al. [2001] and ALS Prolog by Guo and Gupta [2001]. These techniques are known as linear tabling. The SLG algorithm of XSB [Chen and Warren 1996, Swift and Warren 2011, Warren et al. 2007] is an extension of OLDT, described above. It differs from OLDT mainly in that it suspends subgoal computations, which may later be resumed, as necessary, similarly to the discussion in Section 1.5.2.1. The fact that it suspends and resumes computation means that it does not do the recomputation that is necessary with iterative methods, such as ET, and thus has better, and simpler, complexity properties. XSB also incorporates an “incremental completion” algorithm that, by controlling the order of task scheduling and keeping track of subgoal dependencies, determines when subgoals are completely evaluated (that is, have

60

Chapter 1 Datalog: Concepts, History, and Outlook

all their answers in the table) well before the entire computation is completed. This approach enables efficient handling of default negation. Vielle developed the Query-SubQuery (QSQ) approach [Vieille 1986] specifically for Datalog, and it differs from tabling and ET in having “multigoals” where a given argument position in a predicate can contain a set of bindings. (We note that the original version of QSQ was incomplete on some programs—it did not always find all answers. Later work by Nejdl [1987] and Vieille [1987] provided complete versions of QSQ.) Such generalized goals can arise both from combining goals sharing the same pattern of variables and constants, and by solving goals against an EDB in a set-at-a-time manner. For example, consider solving the query :- related(dsw, X).

using the second rule for related/2: related(P1, P2) :- advised(B, P1), related(B, P2).

The first subgoal will be advised(B, dsw). If we ask for all matching bindings from the EDB at once, we will get back {advised(jbf, dsw), advised(wcr, dsw)}. From that result, QSQ will generate the generalized subgoal related({jbf, wcr}, X). That subgoal can be matched with the same related rule, generating the generalized subgoal advised(B1, {jsf,wcr}). This subgoal illustrates one of the advantages of QSQ, that it can merge several calls to the EDB into one call (which is useful if the EDB is managed externally, say, by a database system). Solving this generalized subgoal will give the set of bindings B1 = {hw, dss}. However, rather than creating a new generalized subgoal related({hw,dss}, P2), these bindings will be merged with the existing generalized subgoal to get :- related({jbf,wcr,hw,dss}, X).

If any of the new bindings had matched old bindings, it would indicate that there might be some results already available. QSQ comes in two flavors: iterative (QSQI) and recursive (QSQR). Both involve the steps of generating new answers and creating new subgoals [Gottlob et al. 1989]. QSQI gives priority to new answers, and suspends working on any new subgoals until all answers have been generated that do not require those subgoals. In QSQR, on the other hand, when a new subgoal is encountered, it becomes the focus and processing of the current subgoal is suspended. In practice, QSQI tends to be worse because of duplicate rule firings [Bancilhon and Ramakrishnan 1986]. QSQR seems closely related to SLG resolution applied to programs without negation, but to our knowledge no detailed comparison has been attempted.

1.5 Evaluation Techniques

1.5.3

61

Bottom-Up vs. Top-Down The question naturally arises as to whether bottom-up or top-down is generally better for Datalog evaluation, and which particular methods are best. Obviously, one can concoct programs and datasets that favor a particular method, and some methods will not work at all on a given program. (For example, some counting variants of Magic Sets work for only some patterns of recursion.) Bancilhon and Ramakrishnan [1986] compare a range of bottom-up and top-down methods analytically on a variety of Datalog programs, EDB structures and query-binding patterns. While there are different orderings of the methods on the different cases, some patterns emerge, such as Magic Sets and QSQR generally being comparable in predicted performance, and usually beating QSQI and semi-na¨ıve. Ullman [1989] claims that there are bottom-up methods that will always perform as well or better as top-down methods on Datalog. His argument is that for a given Datalog program P and query Q, there is a modified version of P that will generate all answers for Q that, when evaluated by the semi-na¨ıve method, will beat a depth-first top-down method. He further claims that the Magic Sets approach bounds the performance of memoizing versions of top-down such as OLDT and QSQ. Bry [1990], however, disputes Ullman’s conclusion. He defines a top-down method called the Backward Fixpoint Procedure (BFP) and claims that bottom-up rewrite methods and top-down memoizing methods are actually BFP in different forms. In terms of asymptotic time and space complexities, Tekle and Liu [2010, 2011] have recently established the precise relationship between (1) bottom-up evaluation using Demand Transformation and Subsumptive Demand Transformation, as well as Magic Sets, and (2) top-down evaluation using variant tabling and subsumptive tabling. However, the actual performance varies widely due to constant factors, and performance of internal data structures. Faced with the ambiguity of such analyses, some Datalog systems include both bottom-up and top-down methods, though determining the best method and optimizations to apply to a particular Datalog program and dataset is far from a solved problem.

1.5.4 Evaluation Methods Using Database Systems An obvious approach to scaling is to use a relational DBMS to manage the ground facts in a Datalog program, which are easily stored as rows in a table, where each predicate has a separate table. The DBMS might be used just for secondary-storage management and indexing capabilities, or for more general query processing and evaluation. As an example of the former, consider top-down evaluation of the query related(lvk, Y). If we are considering the rule

62

Chapter 1 Datalog: Concepts, History, and Outlook

related(P1, P2) :- advised(A, P1), advised(A, P2).

we will need to solve the sub-goal advised(A, lvk). If advised is stored in a DBMS, and indexed on the second position, the database could quickly retrieve all facts having lvk as the value in the second position. Even if we were using a very simple top-down strategy, and only considering advised(A,lvk)-facts one at a time, bringing them into memory in a batch and caching them can save I/O time. As an example of the more general strategy, consider a rule-compilation approach. When working on a query, the evaluator can accumulate EDB goals, while trying to solve IDB goals with rules. Then, when a goal list contains only EDB goals, it can be converted to a database query. For example, if for the query related(lvk, Y) we try to solve the goal using the rule related(P1, P2) :- advised(B, P1), related(B, P2).

we obtain a goal list advised(B, lvk), related(B, Y). If we then, in turn, solve the related/2 goal with the previous rule, the resulting goal list is advised(B, lvk), advised(A, B), advised(A, Y).

which consists entirely of EDB goals. This goal list can be converted to the SQL query SELECT a1.AID, a2.AID, a3.SID FROM Advised a1, Advised a2, Advised a3 WHERE a1.SID = ’lvk’ AND a1.AID = a2.SID and a2.AID = a3.AID;

If there are multiple ways to solve IDB goals, then there will be multiple database queries. (In the presence of recursion, the number of queries may be unbounded, however.) Bottom-up evaluation can also make use of DBMS capabilities. The database can handle all or part of evaluation of each application of TP for a Datalog program P. A rule that mentions only EDB predicates can be translated into an SQL query. A rule that contains one or more IDB predicates needs to also access newly derived facts during right-to-left evaluation. We can create temporary tables in the database for each IDB predicate, to hold derived IDB facts, thereby having TP execute entirely within the database. Alternatively, a bottom-up evaluator might hold IDB facts in its own memory, and evaluate each rule with a nested-loop approach, issuing database queries for EDB facts, taking into account each combination of bindings determined by the predicates outside of those queries. We discuss spe-

1.5 Evaluation Techniques

63

cific instances of these approaches in the following paragraphs—first TD methods and then BU methods. In terms of specific approaches, top-down methods tend to be more “proof” oriented, driven by derivation, whereas bottom-up methods are more “syntactic,” based on the structure of the program rules. Reiter [1977a] talks about an initial “compilation” phase where the IDB is consulted, resulting in a set of queries against the EDB. More specifically, he describes feeding the IDB to a theorem prover that flags EDB literals for later evaluation, and proposes that evaluation take place in a relational database. (Reiter notes that his particular method will not be complete for recursive queries.) The DADM [Kellogg et al. 1977] and DEDUCE 2 [Chang 1977] systems also present approaches where the IDB rules are first processed to yield (perhaps multiple) goal lists of EDB literals that can be evaluated against a relational database. Cuppens and Demolombe [1988] describe an accumulation approach to collect maximal sequences of EDB predicates for database evaluation. Grant and Minker [Grant and Minker 1981] note that the collection of EDB queries generated by such approaches will often have shared subexpressions that can be factored out for efficiency. Ceri et al. [1989] address overlap of database retrievals by caching the results returned from queries. The BERMUDA system [Ioannidis et al. 1988] takes a similar approach, collecting sequences of EDB predicates for evaluation by the database, caching results in a file, and having a “loader” component provide access to answers to a Prolog interpreter on demand. On the bottom-up side, it has long been known that the immediate consequence operator TP for a Datalog program P can be translated into a relational-algebra expression. Ceri et al. [1986] show that a complete translation of bottom-up evaluation can be obtained by extending relational algebra with a fixpoint operator. Ceri and Tanca [1987] discuss various optimizations that can be applied to such a translation. The PRISMAlog language [Houtsma and Apers 1992] was a Datalog extension that was also translated to relational algebra extended with a general fixpoint operator, then optimized into standard relational-algebra expressions with possibly transitive closure over them. Relational algebra generally needs to be translated to SQL before sending it to a relational DBMS. Draxler [1993] provides a direct translation of a logic language into SQL, bypassing relational algebra. The NED-2 Intelligent Information System [Maier et al. 2002] introduced a specialized syntax for database queries in a logic language that are transmitted to the DBMS as SQL. Finally, van Emde Boas and van Emde Boas [1986] describe an approach to extending a relational DBMS (IBM’s Business System 12) with logic programming functionality, but it is unclear whether the project got beyond the prototype stage.

64

Chapter 1 Datalog: Concepts, History, and Outlook

1.5.5 Query Optimization in Datalog Optimization of Datalog queries has been studied from a theoretical perspective as well, mostly in the 1980s. Since the optimization problem is equivalent to finding another Datalog program that computes an equivalent set of facts given a set of facts for extensional predicates, but is faster to evaluate, optimization has been mainly looked at in the context of query equivalence or, more generally, query containment. Query containment means that the results of one query are always guaranteed to be a subset of the results of another query. Showing containment in both directions proves equivalence. Many Datalog optimization techniques involve simplifying the original Datalog program by removing rules or subgoals and then testing for equivalence with the original. Unfortunately, the equivalence problem for Datalog queries has been shown to be undecidable [Shmueli 1993]. Furthermore, query equivalence turned out to be undecidable for many subclasses of Datalog, e.g., even if each rule contains at most one recursive predicate [Gaifman et al. 1993]. Thus, fragments of Datalog for which the query containment problem is decidable have been of strong interest. In particular, Cosmadakis et al. [1988] have shown that, for monadic Datalog, where only predicates with one arguments are recursively defined, query containment is decidable and that if binary relations are allowed to be recursively defined then the problem is undecidable. A stronger version of query equivalence, called uniform equivalence, is the decision problem of whether two Datalog programs compute an equivalent set of facts given initial facts for both intensional and extensional predicates. Sagiv [1987] showed that uniform equivalence is decidable, and provided an algorithm for minimizing a set of Datalog rules with respect to uniform equivalence. For an even simpler subset of Datalog, with no recursion altogether, called union of conjunctive queries, many positive results regarding query optimization have been obtained, especially in the context of SQL [Chaudhuri 1998]. More recently, Vardi [2016] showed that the query containment problem for union of conjunctive queries extended with transitive closure is also decidable. For Datalog optimization techniques that involve constraints, see Section 1.4.5. Another class of equivalence results looks at whether a (syntactically) recursive Datalog program has a non-recursive counterpart [Chaudhuri and Vardi 1992, Naughton 1986], We should add that the area of Description Logic [Baader et al. 2003] is concerned with the query containment and equivalence questions for various subsets of first-order logic, which have a limited intersection with Datalog. However, we are not aware of a result from that area that yields new insights when applied to that intersection.

1.6 Early Datalog and Deductive Database Systems

1.6

65

Early Datalog and Deductive Database Systems While much of the initial work on Datalog looked at evaluation options, semantics and extensions, a number of fairly complete implementations of deductive database systems were built in the late 1980s and early 1990s. These systems differed in several ways. One was in how they provided for ordered execution. Although Datalog has pure logical semantics and (at least, in theory) the programmer does not need to worry about rules and goal ordering, most applications need operations with side effects, such as updates and I/O, where order is important. Thus, these systems tended either to embed declarative queries in a procedural language, or have certain rules where evaluation order was enforced. A second source of variation is in what kinds of extensions of basic Datalog were supported, such as structured terms, collections, negation and aggregation. A third aspect was the optimizations and evaluation strategies employed, which are dictated to some extent by the extensions supported. The final area we call out is how base relations were managed. Some systems had disk-based storage for the EDB, either custom built or using an existing DBMS or storage system. Others worked with a main-memory database (possibly backed to secondary storage between sessions). The coverage that follows is by no means exhaustive, but serves to illustrate the differences we mentioned above. We have also leaned toward systems that were publicly distributed. We recommend the survey by Ramakrishnan and Ullman [1995] for a more in-depth comparison of systems from that era. Also, while we refer to these systems in the present tense, to our knowledge only XSB is still being supported and widely used, though some of the concepts and techniques of LDL++ are incorporated in DeALS [Shkapsky et al. 2013].

1.6.1 LDL and LDL++ The LDL language and implementation were developed at the Microelectronics and Computer Technology Corporation (MCC) during the mid-1980s as an alternative to lose coupling between logic languages and databases [Tsur and Zaniolo 1986, Zaniolo 1988]. LDL was intended to be complete enough for full application development, although it can be called from procedural languages and can call computed predicates defined in C++. While starting from pure Horn clauses with declarative semantics, it does provide some control constructs to enforce order on updates and screen output.

66

Chapter 1 Datalog: Concepts, History, and Outlook

LDL preserves complex terms from Prolog, such as date(1981, ’March’), and augments them with sets. Set values can be listed explicitly, or created in rules by grouping results. For example, a rule advisees(AID, ) :- advised(AID, PID).

will return result facts such as advisees(lvk, {jmy, fg}).

LDL provides a partition predicate that can non-deterministically split a set, to support parallel processing of large datasets. LDL also supports non-determinism through a choice predicate, which selects a specific assignment of values from a range of options. For example, in oneAdvisee(AID,PID) :- advised(AID,PID), choice((AID),(PID)).

the choice((AID), (PDI)) goal ensures that each advisor id AID will have only one person id PID associated with it in oneAdvisee. (In relational database terms, the view oneAdvisee will satisfy the functional dependency AID → PID.) LDL supports stratified negation, which is handled via an anti-join strategy. For example, the rule noAdvisees(First,Last) :- person(AID,First,Last), not advised(AID,_PID).

is evaluated by anti-join of the person table with the advised table on their first arguments. (Anti-join removes any rows from its first argument that connects with at least one row in the second argument on the join condition. Here it will knock out any person-fact that has a matching advised-fact.) LDL has two implementations. The first compiles LDL into FAD, an algebraic language designed at MCC for execution on a parallel database machine. A later, single-user prototype for Unix called SALAD [Chimenti et al. 1990] was created to make LDL available more broadly. SALAD was among the first systems to integrate many of the evaluation techniques covered in Section 1.5. SALAD applies both Magic Sets and Generalized Counting [Sacc` a and Zaniolo 1986a] rewrites. (The latter is suitable for graph-like queries where path length matters but not path membership.) SALAD evaluation is bottom-up semi-na¨ıve, but has several variants for processing joins associated with individual rules, such as various combinations of eager and lazy evaluation, sideways information passing, materialization, and memoization. Some methods support recursive predicates, and some can be used with intelligent backtracking. Intelligent backtracking detects cases where a join

1.6 Early Datalog and Deductive Database Systems

67

should not continue iterating from an earlier subgoal than it normally would. Consider the rule for the csInfo predicate from the initial example, csInfo(First, Last, Univ, Year) :thesis(PID, Univ, Title, Year, Area), person(PID, First, Last), area(Area, ’Computer Science’).

which is essentially a join of the thesis, person and area relations. Suppose we are currently considering the two rows thesis(hw, ’Harvard’, ’Economical Ontology’, 1948, lg). person(hw, ’Hao’, ’Wang’).

and fail to find any area row matching the goal area(lg, ’Computer Science’).

There is no reason to return to consider further person-rows, since the hw value comes from the thesis-row. The SALAD optimizer chooses among the joinprocessing options based on program requirements, as well as estimated costs and index availability. The LDL compiler can process a predicate differently depending on the binding pattern of a call. So, for example, for csInfo(b, b, f, f) (a call with the first two arguments bound and the last two free), it might choose to put person first in the join, whereas for csInfo(f, f, b, b), it might put person last in the join. LDL++ [Arni et al. 2003] was an extension of LDL begun at MCC and completed at UCLA. LDL++ adds support for user-defined aggregates (UDAs), offering both forms that provide final results as well as versions that can return “online” results. For example, an aggregate for average might return the intermediate value after each group of 100 inputs has been examined. LDL++ also introduced a testable form of local stratification in which iterations of a predicate are explicitly indexed, and rules with negation must only use results available at a previous index. LDL++ added access to external databases to SALAD’s internal main-memory database. LDL++ can create SQL calls to external sources that contain multiple goals, thus benefitting from the DBMS’s query optimizer.

1.6.2 CORAL Ramakrishnan et al. [1994] developed the CORAL deductive system at the University of Wisconsin. To support imperative aspects, CORAL is “mutually embedding” with C++ code—that is, CORAL modules can be called from C++ and evaluable

68

Chapter 1 Datalog: Concepts, History, and Outlook

predicates for CORAL can be written in C++. In terms of Datalog extensions, CORAL supports structured terms in predicates and explicit multi-set values, with grouping capabilities similar to LDL sets. CORAL also provides functions, such as union and subset, to operate over multi-set values, and aggregation functions, such as sum and count, that can reduce a multi-set to a scalar. CORAL supports a generalized form of stratified negation where a predicate can depend negatively on itself, as long as no individual ground goal does. So, for example, if a predicate p can have a rule of the form p(PID1) :- ..., advised(PID1, PID2), not(p(PID2)), ...

as long as the advised is acyclic, meaning a goal can never depend negatively on itself. CORAL programs are structured into modules, where different optimizations and evaluation techniques are applied to different modules. CORAL offers a variety of program rewriting techniques, including Magic Templates [Ramakrishnan 1991] and variants. The programmer can annotate a module to indicate which kind of rewriting to use. Other annotations on a per-predicate basis can control aspects such as duplicate elimination, aggregate selection (for example, retain only minimum-cost paths), and prioritization of newly derived facts for use in further deductions. The evaluation technique can be selected on a per-module basis, as well. A CORAL module can be evaluated bottom-up, using semi-na¨ıve evaluation, or top-down in “pipelined” fashion, which provides a cursor-like functionality to retrieve query answers, but does not retain derived results. CORAL supports persistent EDB data via either file storage or the EXODUS storage manager [Carey et al. 1986]. Data in files is loaded all at once when first accessed, whereas EXODUS can load pages of tuples incrementally.

1.6.3 Glue-Nail Glue and Nail are a pair of set-oriented languages developed at Stanford University to work together cleanly [Derr et al. 1994]. Glue is a procedural language, whereas Nail is a declarative logic language originally developed as part of the NAIL! system [Morris et al. 1986]. Glue provides for control structures, update operations, and I/O. Its basic computation is a join operation, which can take any combination of stored relations, Glue procedures, and Nail predicates. Glue and Nail support structured terms, including literals and terms with variables in predicate and function names, an extension of HiLog syntax referred to as NailLog. Sets are handled differently than in the previous systems described. Rather than sets being a new kind of structured term, each set value is a predicate with a

1.6 Early Datalog and Deductive Database Systems

69

distinct name, parameterized by one or more values. So, for instance, to construct advisee sets as in an earlier example, we can use the Glue assignment statement advisees(AID)(PID) := advised(AID, PID).

One such set would be advisees(lvk)(jmy). advisees(lvk)(fg).

and the value advisees(lvk) can be passed around as a name for this extent. Glue also supports subgoals that compute aggregates over tuples defined by previous subgoals in an assignment statement. Both Glue and Nail are compiled into an intermediate language, IGlue, which can execute joins, tests and data movement. The Nail compiler uses two variants of Magic Sets, and decides which evaluation strategy to use at run time, which can be semi-na¨ıve, stratified, or alternating fixpoint [Morishita 1993]. The Glue compiler analyzes the possible matches for each subgoal (which can be large because of the use of variables in predicate names), and limits the run-time tests to those predicates and procedures that could possibly match. There is also an IGlue-toIGlue static optimizer that performs various data-flow analyses, including constant propagation. The optimizer also detects cases where a virtual relation can have at most one tuple, and provides a simpler data structure for that relation. The IGlue interpreter also has an optimizer for use at run time. The need for this component arises from the fact that during iterative computation (such as for a recursive predicate), the size of relations can change, hence the optimal execution plan for a join can change. The IGlue interpreter will therefore reoptimize a query if a relation’s cardinality changes by more than a factor of two up or down. The Glue-Nail systems keeps all data in main memory during execution, but reads EDB relations from disk initially, and writes query results back to disk.

1.6.4

EKS EKS is a Knowledge-Base Management System (KBMS) initially developed at the European Computer-Industry Research Center (ECRC) and later transferred to commercial interests [Vieille et al. 1992]. EKS is callable from MegaLog, an extended Prolog system that provides efficient disk-based storage via BANG files, a data organization similar to grid files that support look up on multiple attributes [Freeston 1987]. While EKS does not have complex terms, it does support negation and several other language features—the and and or connectives, aggregation, explicit

70

Chapter 1 Datalog: Concepts, History, and Outlook

quantifiers, evaluable predicates—that are translated into an extended Datalog. It supports general integrity constraints in the form of yes-no queries on base and virtual predicates that need to be satisfied if an update is to be allowed. For example, a constraint might require that the person and area mentioned in any thesis-fact have corresponding facts in the person and area tables: forall[PID, Univ, Title, Year, Area] -> (exists[First, Last] person(PID, First, Last)) and (exists[Desc] area(Area, Desc)).

Constraints are supported via propagation rules and transactions. The propagation rules determine which facts actually have to be checked upon update. For example, deletion of a area-fact requires checking only for any thesis-facts that have that area. Transactions can roll back an update if any constraints are violated. The transaction capability supports other features, such as hypothetical reasoning in which “what-if” queries can be run over temporary updates. EKS rule compilation begins with determining an evaluation order on subgoals that satisfies any restrictions on binding patterns for the corresponding predicates. Such restrictions can arise from evaluable predicates, the use of negation, or virtual rules whose own rules induced binding restrictions. The rules are then translated into programs consisting of BANG operators—such as join, select, and difference—plus control constructs. These programs implement a breadth-first, top-down search of the rules, using the Query-Subquery approach to avoid solving the same goal repeatedly. As mentioned, EKS relies on BANG files for external storage. BANG files are designed to accommodate the need to look up facts (and clause heads in general) based on alternative combinations of attributes, since a predicate can occur with varying binding patterns. EKS was transferred to the French computer company Bull, and became the basis for the VALIDITY deductive object-oriented database [Vieille 1998]. VALIDITY became the property of Next Century Media, which used it to support targeted advertising.

1.6.5 XSB The XSB Logic Programming System was developed by David S. Warren and his students at Stony Brook University in the early 1990s and is being constantly enhanced and extended to this day. It is a conservative extension of the Prolog program-

1.6 Early Datalog and Deductive Database Systems

71

ming language, by the addition of tables in which subgoal calls and answers are maintained to avoid redundant computation—an embodiment of the memoization strategy discussed in Section 1.5.2.1. In the context of a relational language, such as Prolog, where a single predicate (or procedure) call can have multiple (or no) answers, tabling changes the termination characteristics of the language, avoiding many infinite loops that are notorious in Prolog. With tabling, XSB is complete and terminating for Datalog programs [Chen and Warren 1996], and as such can be considered as an implementation of an in-memory deductive database system [Sagonas et al. 1994]. Being an extension of Prolog, XSB can be used as a procedural programming language [Warren et al. 2007], just as Prolog. The programmer determines which predicates are to be tabled, either by explicitly declaring them to be tabled or by providing an auto_table directive telling the system to chose predicates to table. Thus, the programmer can use the Prolog language to provide the necessary procedural capabilities, such as loading files, generating output, and controlling query evaluation. XSB, as an extension of Prolog, includes function symbols and complex terms, including lists. Of course, termination is not guaranteed if the programmer uses them, but they provide programmers with the ability to program their own extensions. So programmers, for example, can use Prolog builtins, such as findall and setof, to collect lists of goal solutions and then use Prolog’s list-processing capabilities to compute arbitrary functions over those lists. XSB also provides a general aggregation capability, integrated with tables, that allows programmers to define their own aggregation operators to be applied to all answers of a query. These operators can be used in recursive rules to solve such problems as finding the length of the shortest path between a pair of nodes in a directed graph. XSB does minimal automatic query optimization. The general philosophy is to allow the programmer to control how the query is evaluated by determining how it is specified. This approach was inherited from XSB’s foundation in the Prolog programming language. XSB does include a few simple program transformations to improve indexing and to factor clauses to potentially reduce the polynomial complexity of some queries. But these optimizations must be explicitly invoked by the user. XSB also supports subsumptive tabling, which allows a table to be re-used if a proper instance of the original goal is encountered. This capability supports inserting a bottom-up component into the normal goal-directed computation of XSB. For example, for transitive closure, we could declare the predicate as subsumptively tabled, and then make the fully open call (i.e., where each argument is a distinct variable). In this case, there will be only one call that uses the clauses that match

72

Chapter 1 Datalog: Concepts, History, and Outlook

this predicate; every subsequent call is subsumed by that first call, and so it will suspend on the table associated with the initial open call, waiting for answers to show up that it can return. So the first answers must come from the base case, since only these answer do not require a recursive call. The next answers will come from a single use of the recursive call, and so on, simulating bottom-up computation. XSB does not directly provide much support for persistent data. Using Prolog, programmers can read and write files to define internal predicates. So the programmer can batch-load data files, or can read them incrementally. Some work explored evaluation-control strategies that would provide more efficient disk access patterns when accessing external data [Freire et al. 1997]. XSB also has a powerful ODBC interface to any relational database system,20 so an XSB user has access to data persistence facilities.

1.6.6

Other Systems Aditi was a deductive database project started in 1988 at the University of Melbourne [Vaghani et al. 1994]. Aditi-Prolog is a declarative variant of Prolog, which can be used to define predicates in the NU-Prolog system [Ramamohanarao et al. 1988]. Aditi can evaluate predicates bottom up (with optional parallelism) and in tuple-at-a-time top-down mode. Aditi provides different strategies to order literals in a clause during compilation, such as choosing the next subgoal to maximize bound variables or to minimize free variables. Aditi programs are translated to relational algebra programs, which are further optimized using programminglanguage and database techniques. Joins can be evaluated by several methods, which can take advantage of different tuple representations that Aditi supports. Parallelism is available both among different subgoals of a rule and among multiple rules for a predicate. DECLARE is the language for the Smart Data System (SDS) [Kießling et al. 1994] whose foundational work was done at the Technical University of Munich (TUM) and then continued at a research and development center during the period of 1986–1990. One of the main targets of SDS was decision making over heterogeneous data, hence it emphasizes connectivity to multiple data sources, both local and remote. DECLARE is a typed language using SQL types plus lists and supporting constraints. SDS provides distributed query processing, where evaluation plans can contain complex SQL queries executed by external systems. Also, the SDS developers were interested in search strategies beyond top-down and bottom-up techniques, in particular, using domain-specific knowledge. For example, in a rule

20. http://docs.microsoft.com/en-us/sql/odbc/microsoft-open-database-connectivity-odbc

1.7 The Decline and Resurgence of Datalog

73

for connecting flights, the intermediate city will generally be in the same geographic region as the source and destination cities. LOLA is another deductive database system from TUM [Freitag et al. 1992], which can access stored relations in external databases and predicates defined in Common Lisp. It supports type groups that collect terms into classes, such as airports and geographic regions. LOLA is translated to an intermediate language, from which it builds operator graphs that are optimized for subexpressions and indexing opportunities. These graphs are evaluated in R-Lisp, which implements an in-memory algebra for relations with complex objects. Some of the preprocessing routines are in fact written in LOLA itself. RDL is a production-rule system from INRIA [Kiernan et al. 1990]. It uses “if . . . then” rules with the consequent being a sequence of inserts, deletes, and updates of stored or virtual tables. We mention it here because unlike most production-rule systems, which fire a rule separately for each instantiation, RDL takes a set-oriented approach similar to most Datalog systems. It computes all bindings of a rule body under the current database state, then processes all modifications. The Starburst project at IBM sought to make relational DBMSs more extensible in several dimensions, including query processing [Haas et al. 1989]. It introduced table functions. A table function produces a table as a result, and can call itself in its own definition, thus expressing a recursive query. An interesting outcome of the project was a method for applying the Magic Sets technique to full SQL [Mumick and Pirahesh 1994]. FLORID [Frohn et al. 1998] is an object-oriented Datalog system developed in the 1990s at Freiburg University. It differs from all the aforementioned early systems in that it was based on the object-oriented model provided by F-logic (see Section 1.4.6) and not on the usual predicate syntax. It was thus one of the first systems that provided theoretically clean support for semi-structured data.21 It was also one of the first to support the well-founded semantics for negation (Section 1.4.1). In other aspects, FLORID’s evaluation strategy is bottom-up and it has a very flexible support for aggregation, which allows aggregations to be nested arbitrarily.

1.7

The Decline and Resurgence of Datalog After its swift rise in the 1980s, interest in Datalog entered into a period of decline in 1990s and then started to rise again in 2000s. This section reviews some of the reasons for the decline and the subsequent resurgence. 21. Another was WebLog [Lakshmanan et al. 1996], which deals with semi-structured data using a language very similar to F-logic.

74

Chapter 1 Datalog: Concepts, History, and Outlook

1.7.1 The Decline With the notable exception of XSB, the systems in the previous section are no longer supported. A limited form of linear recursion made it from Starburst into IBM’s DB2 product and into the SQL:1999 standard. Datalog and deductive databases did not emerge as a major commercial category. Research on Datalog ebbed in the 1990s. The reasons for the waning of interest are not completely clear, but we consider here some contributing factors. We were helped by comments from colleagues in the field (see Acknowledgments). No “Killer Application”. A new technology is often propelled by a “killer applica-

tion” that it enables for which existing technologies are ill-suited. For example, the World Wide Web led to major growth in the Internet (and network connectivity in the home). No such application emerged for Datalog—nobody seemed that interested in computing his or her ancestors. While there were problems that could not be handled directly by conventional relational DBMSs, there were often specific products in the market to solve those problems. For example, IBM’s COPICS product had a module that handled the Bill-of-Materials problem (traversing down the part-subpart hierarchy while aggregating numbers of parts, weight, total cost, and so forth), thus supporting a particular type of recursive query. Theoretical and Practical Limitations. There were limitations in both evaluation

technology and computer hardware that hampered the efficiency and applicability of early Datalog systems. While Magic Sets and similar approaches helped the performance of bottom-up evaluation, in the worst case they could cause an exponential expansion in program size, if the target predicates had many binding patterns. For top-down evaluation, while tabling techniques had been investigated, there were still issues with handling negation, incrementality, and the best way to schedule subgoals. On the hardware side, computers were limited in processor power and memory (4–64 MB of main memory was typical for desktop machines in the 1990s), which meant that Datalog programs on what today would be considered small amounts of data could easily be disk bound. Loss of Mindshare by Logic Programming in General. Logic programming was being

overshadowed by object-oriented languages, which were much better at user interfaces and graphics at a point where powerful desktop workstations were coming on the market. Prolog had modest UI capabilities, and most would not transfer into Datalog, so you could not write a complete application in it, in contrast to objectoriented languages such as Smalltalk and C++. Programmer productivity issues seemed focused on graphics and interactivity vs. sophisticated query and analysis,

1.7 The Decline and Resurgence of Datalog

75

though there was still interest in expert systems and knowledge bases. Coverage of Datalog (and logic programming) in academic settings declined, meaning there were fewer programmers versed in logic languages as compared to object-oriented languages. Antipathy in the Database Systems Community. There was open antagonism by

some in the commercial and systems research communities toward deductive database work. Some of that ill will seemed rooted in a vested stake in SQL orthodoxy (which also manifested in criticism of object-oriented databases). That sentiment might also have come from feelings that deductive database work was not addressing the most important issues in data management, such as extensibility and spatial access methods. The nature of some of the research contributions might have added to these perceptions. For example, there were many papers on proposed evaluation techniques, but a lot of them were not actually implemented, much less benchmarked for performance. Barriers to Entry in the Database Market. While Datalog systems could support more

expressive querying than SQL-based DBMSs, the latter had many features that were often primitive or lacking in Datalog implementations, such as concurrency control, recovery, authorizations, cost-based optimization, indexing (and other physical design options), replication and back-up services, integrity constraints, and programming interfaces from multiple languages. While some Datalog implementations had some of these features, most database applications need all of them. Faced with a choice between recursive queries and the other features, developers chose the other features. Recursive queries could be implemented in a generalpurpose programming language making calls to the DBMS, whereas simulating the missing features in Datalog systems that did not have them was in most cases infeasible. Some of the features, such as concurrency and recovery, need quite sophisticated techniques to perform efficiently, and the investment costs to provide them to a Datalog system were likely beyond the means of most development projects. An alternative approach is to have a regular DBMS as a submodule of a Datalog system (and some systems followed this approach). Such a hybrid could provide most of the features mentioned, however, query optimization and evaluation was split across two components, limiting the performance of the resulting system. Datalog systems could not meet the minimum requirements to compete in the data-management market. Inaccessibility of the Literature. Some whom we consulted said they found the

deductive database literature overly dry and complicated. Papers were not easy to

76

Chapter 1 Datalog: Concepts, History, and Outlook

understand, and often lacked compelling examples or applications. This aspect may have limited uptake of ideas into existing DBMSs, especially since papers did not say how to fit these approaches into current implementation technology. Also, while Datalog bore close resemblance to the domain relational calculus, leading relational languages, such as SQL and QUEL, were tuple-calculus based. While the difference may seem trivial, and although in many cases Datalog allowed more concise expression of queries, positional notation can be challenging and errorprone for predicates with dozens of attributes, compared to named notation. Contributions of Datalog. Even during its decline stage and before it started to surge

again, Datalog had lasting effects on the contributing areas of logic programming, knowledge representation, and databases. In logic programming, it provided a “declarative reset,” and got people thinking again about how far one could go in logic languages without procedural or extra-logical features. For example, Datalog provided an arena to examine the recursion and negation, to find approaches that were more declarative than negation-as-failure. One approach was the Stable Model semantics, which led to Answer Set programming—the focus of much of current logic programming activity. In knowledge representation, it has influenced languages to deal with graph-based knowledge structures. For example, the core of the SPARQL language for RDF builds from Datalog-like goals over triples of the form (Subject, Predicate, Object). (SELECT queries in SPARQL have been shown equivalent to non-recursive safe Datalog with negation [Angles and Gutierrez 2008] and some implementation strategies for SPARQL are based on translation to Datalog [Polleres 2007].) In the database area, many SQL products now have support for certain forms of recursion, and some evaluation techniques for Datalog have been applied to non-recursive aspects of relational query languages. Another large area of contribution for Datalog was as a vehicle for understanding the complexity and expressiveness of database querying. Complexity refers to the amount of time or space to evaluate a query over a database, as a function of input size. The relevant input can be the data (for data complexity), the query (for query or expression complexity), or both (for combined complexity). The complexity is generally characterized by broad classes, such as logarithmic space (LOGSPACE) or polynomial time (PTIME). A classical result is that domain relational calculus (essentially non-recursive Datalog with negation) has LOGSPACE data complexity and PSPACE query complexity. (Domain relational calculus is PSPACE-complete, meaning—informally—that domain relational calculus evaluation is as hard as any problem in PSPACE.) Abiteboul et al. [1995] provide an introduction to such complexity concepts and results. We note that Datalog itself is PTIME-complete

1.7 The Decline and Resurgence of Datalog

77

for data complexity and EXPTIME-complete for query and combined complexity [Immerman 1982, Vardi 1982]. Datalog has helped the study of database query complexity (and decidability), particularly in terms of the complexity “consequences” of different language features and semantics. Compared to a syntactically complicated language such as SQL, language features can be more cleanly removed from or added to Datalog, such as recursion, negation, second-order variables, inequality literals, ordered domains, fixpoint operators, iteration, disjunction (in rule heads) and complex objects. It also readily admits alternative semantics, such as Well-Founded vs. StableModel semantics for negation. A survey by Dantsin et al. [2001] covers complexity of these variants and more. There is also work investigating restrictions of Datlog, such as linear recursion (at most one IDB goal per rule body) [Ullman and Van Gelder 1988] and limitation to a single rule [Gottlob and Papadimitriou 1999]. We also note that Datalog has also proved a useful tool in studying the complexity of non-database problems, such as constraint satisfaction [Feder and Vardi 1999]. Work on expressiveness considers questions such as which queries cannot be expressed in a particular language, how two query languages are related in terms of the queries they can express, and how query languages relate to complexity classes. For example, it is not possible to express a transitive-closure query in non-recursive Datalog [Aho and Ullman 1979], linear Datalog is strictly less expressive than regular Datalog [Afrati and Cosmadakis 1989], Datalog with equality and inequality predicates is equivalent to fixpoints over positive existential queries [Chandra and Harel 1985], and Datalog with ordered structures (that is, with a built-in ordering predicate on the domain of values) exactly captures polynomial time [Papadimitriou 1985]. Such expressiveness results are not purely of theoretical interest—they can show that a language with better“programmability” can be substituted for another without diminishing the queries that can be expressed. For example, both Monadic Datalog (all head predicates have a single variable) and a natural subset of Elog (a language for web wrappers) are as expressive as Monadic Second-Order logic (logic with variables ranging over sets) on trees [Gottlob and Koch 2004]. Thus, for tasks such as information extraction from the Web, the former languages provide an easier-to-use alternative to the latter. The survey by Dantsin et al. [2001] also covers a wide range of expressiveness results.

1.7.2 The Resurgence Work on Datalog always kept going at some level, even after the initial flurry of activity subsided. Beginning around 1998 there was renewed interest in Datalog, driven

78

Chapter 1 Datalog: Concepts, History, and Outlook

by its use to solve real problems rather than by intrinsic properties or theoretical developments (although there have been new extensions and evaluation techniques motivated by these applications). Some of this interest was due to the new (at the time) field of the Semantic Web and its focus on inference [Berners-Lee et al. 2001, Boley and Kifer 2013a, Decker et al. 1998, Guha et al. 1998]. In other cases, Datalog was being applied in settings where database technology was not commonly used, such as program analysis and networking. Thus, its adoption was more a matter of introducing a declarative approach into a new domain or a domain where solutions were typically expressed imperatively, rather than of replacing another database language. Semantic Web. The idea of the Semantic Web emerged in late 1990s as the next step in the evolution of the World Wide Web. Instead of HTML pages, the Semantic Web was to rely on machine-readable logical statements (most of which would be just database facts written in an open standard format called RDF [Lassila and Swick 1999]), making search results more relevant, giving rise to better recommendation systems, making Semantic Web services possible (for example, automatic arrangement of trips), and even enabling automated contract negotiation. The two main research directions in this area are reasoning and machine learning; we focus here on the former. In reasoning, two approaches dominated: OWL [Grau et al. 2008], which is based on a particular strain of knowledge representation, called Description Logic [Baader et al. 2003], and rule-based reasoning, which is based on logic programming and includes Datalog. Both approaches have led to W3C recommendations (W3C calls its standards “recommendations”)—the already mentioned OWL and the Rule Interchange Format (RIF) [Boley and Kifer 2013a, Boley and Kifer 2013b], which is based on logic programming and, more specifically, on F-logic and HiLog (see Section 1.4.6). The expressive powers of OWL and of rules are quite different, which is why both approaches exist, each having its adherents. This bifurcation of efforts on the reasoning side has also led to a number of efforts to reconcile OWL with rules. First, it resulted in the development of an OWL subset, called OWL RL,22 which lies in the intersection of OWL and Datalog. Second, several approaches attempted to combine OWL and logic programming at a much more general level so as to take advantage of both paradigms [Eiter et al. 2004a, Eiter et al. 2004b, Motik et al. 2006]. Overall, building the Semantic Web turned out to be harder than originally envisioned, but the effort is proceeding apace. It resulted in Google’s Knowledge 22. http://www.w3.org/TR/2008/WD-owl2-profiles-20081008/#OWL_2_RL

1.7 The Decline and Resurgence of Datalog

79

Graph, DBpedia, Wikidata, Semantic MediaWiki, and other resources. In addition, it led to a number of commercial Datalog-based systems for RDF stores such as Ontotext’s GraphDB,23 Franz’s AllegroGraph,24 and Apache’s Jena,25 to name a few. Many leading logic programming systems (for example, SWI Prolog, Ciao Prolog, and XSB) also have extensive web-related capabilities. Program analysis is an example of an application of Datalog to an area that had largely used imperative approaches before. Figuring out program properties, such as call chains or pointer aliasing, often involves graph traversal or mutual recursion, which are directly expressed in Datalog. Lam et al. [2005] relate their experience with implementing program analyses to detect security holes in web applications. They turned to Datalog after coding their analysis algorithms in Java proved hard to get correct and efficient. They also found their Datalog implementation easier to maintain and extend. Their evaluation approach to Datalog was to translate it to relational algebra and thence to Boolean functions. Relations with program-structure and execution-context information are represented as characteristic functions (mappings of tuples to true or false). Each function is encoded into a compact structure, called a binary decision diagram, upon which the Boolean functions can be efficiently evaluated. The CodeQuest system [Hajiyev et al. 2006] used Datalog to query program information, aimed in particular at enforcing style rules and programming conventions that the compiler does not enforce, such as requiring all subclasses of an abstract class to reimplement a particular method. The developers report that Datalog hits a “sweet spot” between expressiveness and efficiency. While not working with enormous datasets, the space requirements can exceed main memory, with code bases containing more than one million lines of code and greater than 10,000 classes. Their approach to evaluation is to send Datalog through an optimizing compiler that emits SQL, using iteration in stored procedures or SQL’s Common Table Expressions to handle recursion. The Semmle product takes a similar approach for code inspection and analysis. However, queries are written in an SQL-like language that is translated to Datalog and thence to relational database queries [de Moor et al. 2007]. The Doop system [Smaragdakis and Bravenboer 2011] also queries programs using Datalog, such as performing points-to analysis—determining all the memory locations to which a pointer variable might refer. They report that their analyses are as precise

Datalog for Program Analysis.

23. http://ontotext.com/products/graphdb/ 24. http://franz.com/agraph/allegrograph/ 25. http://jena.apache.org/documentation/inference/

80

Chapter 1 Datalog: Concepts, History, and Outlook

as previous methods, but faster. Their evaluation relies on the aggressive optimization of Datalog, as well as programmer-assisted indexing of EDB relations. The use of Datalog for program analysis illustrates several themes that emerge across recent applications that motivate some of the renewed interest in the language. One is handling of complex reasoning. For example, program analysis involves mutual recursion among different properties being reported, such as pointsto, call graph and exception analysis. Also, the algorithms are expressed in a way that they can often be readily modified via adding an additional parameter in a predicate, or a few more rules. For example, Lam et al. [2005] point out that the Datalog rules for their context-sensitive points-to analysis are almost identical to their counterparts in context-insensitive points-to analysis. In contrast, capturing the reasoning and relationships for program analysis, along with appropriate control strategies, in an imperative programming language is a labor-intensive and error-prone task. Also, while the data sizes involved are not staggering, they are large enough that manual optimization of code may not be sufficient to deal with them efficiently. Another use of Datalog illustrated by the program-analysis domain is the use of Datalog as an intermediate form. It is often fairly easy to write a translator from a domain-specific syntax to Datalog. From there, a wide variety of implementation and optimization options are available. As we have seen, there are multiple control strategies based on top-down and bottom-up methods, and representation alternatives for data and rules, such as binary decision diagrams. Although not discussed extensively in this chapter, Datalog admits a range of parallel and distributed evaluation techniques. With these approaches, Datalog performance can match or exceed that of custom analysis tools [de Moor et al. 2007, Smaragdakis and Bravenboer 2011]. Declarative Networking. Datalog has also gained popularity for reasoning about

networks. The general area of declarative networking [Loo et al. 2009] has grown to encompass network protocols, services, analysis, and monitoring. It is an area with many examples of recursive reasoning over both extensional (link tables) and intensional (state-transition rules) information. Describing network algorithms in Datalog exposes the essential differences between them, generates predicates that can be reused across algorithms, and suggests new alternatives [Loo et al. 2006]. Simple syntactic extensions of Datalog, such as location variables, can capture key aspects of the domain. A range of distributed evaluation techniques are available, including asynchronous data-flow computation derived from the convergence properties of Datalog. (Bottom-up evaluation of a monotonic Datalog program will always converge to the same result regardless of the order in which rules are applied.)

1.7 The Decline and Resurgence of Datalog

81

The P2 facility [Loo et al. 2005] supports the expression of overlay networks, in a (cyclic) data-flow framework using asynchronous messages. The authors found that using Datalog as a basis supported succinct expression of overlay algorithms. For example, the Chord overlay algorithm can be expressed in 47 rules, with many reusable parts, in contrast to finite-automaton-based approaches that are less concise and more monolithic, and the P2 version provides acceptable performance. The Datalog-based approach called out similarities and distinctions among different algorithms. (For example, a wired and wireless protocol were found to differ only in predicate order in one rule body.) Their approach also made hybrid algorithms easy to construct. Overlay specifications in P2 are written in Overlog, a Datalog-based language with extension for location attributes in tuples and continuous queries over a changing EDB (including handling of deletion). Overlog is translatable into data-flow graphs with local Datalog evaluation engines communicating via asynchronous messages. Loo et al. [2006] extend the P2 work, looking at general techniques for prototyping, deploying and evolving continuous recursive queries over networks. In that work, the network is both the object of study and the vehicle for evaluation. Their language NDlog (for Network Datalog) emphasizes this latter aspect through both location flags in predicates and an explicit #link EDB predicate that lists all direct node-to-node connections in the network. Every predicate in NDlog must have a location-specifier variable as the first argument, which is used to distribute tuples in the network. Further, any “non-local” rule must contain a #link literal, and every predicate in the rule must have the source or destination of that link as its location specifier. An example from the paper is path(@S, @D, @Z, P, C) :#link(@S, @Z, C1), path(@Z, @D, @Z2, P2, C2), C = C1 + C2, P = f_concatPath(link(@S, @Z, C1), P2).

which says there is a path from source @S to destination @D with first hop to @Z. (NDlog supports list values through built-in functions.) We see that all predicates in the rule have either @S or @Z as the location specifier. NDlog is compiled into local dataflow graphs that run at each node in the network. A non-local rule such as the one above is split into two local rules communicating through messages across the link. Evaluation is a relaxed form of semi-na¨ıve, in which iterations happen locally, with tuples that arrive from other nodes during an iteration either being used immediately or buffered for the next iteration. The system uses both traditional and newly proposed Datalog optimizations. The former include predicate reordering, Magic Sets and aggregate selection, in which presence of an aggregate function, such as min in shortest path, allows suppression of non-optimal results [Furfaro

82

Chapter 1 Datalog: Concepts, History, and Outlook

et al. 2002]. The authors also present optimizations particular to the distributed setting, based on caching of results that pass through a node, and merging of messages with certain attribute values in common, as well as choosing between (or possibly combining) alternative evaluation strategies based on costs. Abiteboul et al. [2005] apply Datalog to determining possible event sequences leading to alarm reports in a networked application. (A distributed telecom system is their example.) They target Datalog because of the need to reason over extensional (alarms) and intensional (potential execution flows) knowledge with recursive dependencies. The authors note that Datalog’s convergence properties are a good match to asynchronous, distributed settings, such as reasoning in and about networks. Their distributed version of Datalog—dDatalog—assigns whole relations to nodes. They adapt QSQ to evaluating dDatalog in a distributed setting. When a node encounters a remote predicate in evaluating a query, it ships that sub-goal, its binding pattern, and the remainder of the query to the node with the remote predicate. The Cassandra access-control framework [Becker and Sewell 2004] is another example of the use of Datalog in a distributed setting. Cassandra augments Datalog with location and credential-issuer qualifiers, as well as constraints. The domain for the constraints is pluggable, and gives rise to “a sequence of increasingly expressive policy specification languages.” For example, a constraint domain for arithmetic allows specification of policies that limit the length of a delegation chain. Policy rules that can invoke remote policies over the network, to request resources, roles, and credentials, and are interpreted in a policy-evaluation engine. Cassandra chooses a tabled, top-down evaluation approach, because of potential problems with bottom-up evaluation. In particular, the distributed setting means that rightto-left evaluation of rules would require distributed query plans. Datalog and Big Data. More recently, Datalog has been the basis for scalable data-

analytics systems, especially those targeting graph structures such as arise in social networks. SociaLite [Lam et al. 2013] extends Datalog for analyzing social networks. It supports some kinds of recursive aggregates, and allows the programmer to provide hints on data representation and evaluation order, which leads to performance comparable to imperative implementations, but with programs that are an order of magnitude smaller. Distributed SociaLite [Seo et al. 2013] further allows the programmer to specify how data should be shared across servers, and derives how to distribute evaluation and organize communication from that specification. The Myria system supports iterative analytic routines using recursion in Datalog [Wang et al. 2015]. Myria supports asynchronous and prioritized evaluation of Datalog on a

1.7 The Decline and Resurgence of Datalog

83

shared-nothing cluster, as well as fault tolerance. BigDatalog [Shkapsky et al. 2016] also supports recursive analyses, in this case for Spark. While Spark supports iterative processing, the authors point out that directly coding recursive computations via iteration is challenging in Spark, and tends to be inefficient, as there are no opportunities for global analysis and optimization provided in Datalog. BigDatalog is especially well suited to graph analytics, where it outperforms systems specifically targeted at graphs, such as GraphX [Gonzalez et al. 2014]. One advantage systems such as SociaLite, Myria, and BigDatalog have over earlier Datlog systems for analytics application is support for aggregates—for example, max and sum—in recursive rules [Zaniolo et al. 2017]. Datalog is also the basis for parallel approaches to data analysis. Google’s Yedalog [Chin et al. 2015] is a recent Datalog-based system that is designed specifically to exploit data parallelism and aggregation of massive amounts of data. Also, the DeALS system [Shkapsky et al. 2013] has compilation techniques to support multicore processing [Yang et al. 2017]. Other Application Areas. We briefly mention other areas in which Datalog—and its

derivatives—have been applied. The DLV system [Leone et al. 2006] uses Disjunctive Datalog to represent knowledge and to reason about alternatives. Disjunctive Datalog allows the “or” of predicates in the head of a rule, which gives rise to multiple minimal answer sets, similarly to what is seen for negation in a rule body. DLV also includes integrity constraints that rule out certain answer sets, and “soft” constraints that prioritize among answer sets. Ashley-Rollman et al. [2007] use a Datalog-like language called MELD to plan shape-shifting maneuvers for a swarm of modular robots. They find that the declarative approach allows programs that are 20 times smaller and allow more optimizations, compared to imperative alternatives. A variety of Web-related applications also use Datalog. The Lixto system [Gottlob et al. 2004] supports Web data extraction and service definition. With Lixto, a designer can specify a data-extraction wrapper via a special browser and the system generates a Datalog program to perform the extraction. Its Elog language is based on Monadic Datalog (where all IDB predicates are unary), and supports visual wrapper specification. Lixto technology is now part of the McKinsey Periscope product. A follow-on project, DIADEM [Furche et al. 2014], targets fully automated data extraction from collections of web pages. DIADEM uses a form of Datalog± internally to represent domain knowledge. Orchestra [Ives et al. 2008] supports data sharing on the Web, and translates data-exchange programs to an extended version of Datalog with Skolem functions to support virtual-entity creation. WebDamlog [Abiteboul

84

Chapter 1 Datalog: Concepts, History, and Outlook

et al. 2011] is a Datalog-style language for implementing distributed applications among peer systems on the Web. Toward that goal, WebDamlog supports the exchange of both facts and rules. We will also see uses of Datalog for data management in LogicBlox and Datomic in the next section. Also see chapters 7 and 9 to learn about applications in bioinformatics and natural language processing. Summary. Datalog is gaining more acceptance of late, as evaluation methods and

optimization techniques proposed on a theoretical level in the past have been implemented and tested on actual applications. These applications have demonstrated the utility of the language as a direct interface, as an intermediate language, and as a framework for additional features. These application have also inspired new extensions, such as for distribution, along with new evaluation and optimization approaches. There is also a growing number of Datalog vendors and companies that rely on Datalog for their mission-critical applications, including LogicBlox, Datomic, XSB, Inc., Semmle, DLV Systems, Ontotext, and Coherent Knowledge Systems.

1.8

Current Systems and Comparison This section describes and compares four current systems that implement Datalog: XSB, LogicBlox, Datomic, and Flora-2/Ergo. We show how a selection of Datalog rules and queries are written for each system, and report on experiments that illustrate their performance. These systems all implement Datalog (or a superset). However, the observed performance of each system is unique. They show asymptotically different behavior due to fundamental factors, including their choices for evaluation strategies, selection of underlying data structures, and data indexing. Even when these factors are the same, there are differences in performance due to implementation details, such as the choice of implementation language.

1.8.1 Preliminaries: Data and Queries For the illustration of rules and queries, we use the Mathematics Genealogy data [MathGen 2000] that contains basic facts about people, dissertations, and advisor relationships between people and dissertations. We assume the following EDB predicates, which are slightly different from those in the introductory example. In particular, an advisor ID (AID) is connected to a dissertation ID (DID), and a dissertation has a candidate (CID) who wrote it.

1.8 Current Systems and Comparison

85

person(PID, Name) dissertation(DID, CID, Title, Degree, Univ, Year , Area) advised(AID, DID)

The data sets contains 198,962 people, 202,505 dissertations, and 211,107 facts of the advised predicate. In each of the systems, we answer the following five queries to illustrate the performance. 1. Who are the grand-advisors of David Scott Warren? 2. Which candidates got their degrees from the same university as their advisor? 3. Which candidates worked in a different area than at least one of their advisors? 4. Who are the academic ancestors of David Scott Warren? 5. How many academic ancestors does David Scott Warren have? These queries make use of simple joins, recursive rules, and aggregations that facilitate assessing the performance of the systems in various aspects. In the next subsections, we introduce four systems and how these queries are expressed in each.

1.8.2 XSB XSB [Sagonas et al. 1994] is a top-down implementation of Prolog extended with tabled resolution as described in Section 1.5.2.1. XSB allows fine-grained control over which tabling strategy and indexes to use for each predicate, and the choices for tabling and indexing affect asymptotic behavior. Consider Query 1, which is written in XSB as follows: author(D, C) :- dissertation(D, C, _, _, _, _, _). adv(X, Y) :- advised(X, DID), author(DID, Y). % Grand-advisors of DSW q1(N) :- person(D,’David Scott Warren’), adv(X, D), adv(Y, X), person(Y, N). % Enumerate all results to Query 1 enumallq1 :- q1(_), fail; true. ?- enumallq1.

Here the rule defining enumallq1/0 uses a Prolog idiom of a fail-loop, which has the effect of generating all results in the most efficient way.

86

Chapter 1 Datalog: Concepts, History, and Outlook

By default, XSB indexes each predicate on its first argument. However that may not be the most effective choice. For example, in the rule defining q1, after the first subgoal binds D, top-down evaluation will need to obtain values for the first argument of adv given a binding for the second argument. The following directive creates an index on the second argument: :- index adv/2-2.

To illustrate tabling, consider Query 4: anc(X, Y) :- adv(X, Y). anc(X, Y) :- adv(X, Z), anc(Z, Y). % Ancestors of DSW q4(N) :- person(X, ’David Scott Warren’), anc(Y, X), person(Y, N). % Enumerate all results to Query 4 enumallq4 :- q4(_), fail; true. ?- enumallq4.

For this query, one can show that there can be repeated subgoals for predicate anc. The following tabling directive will avoid reevaluating such repeated subgoals and, therefore, avoid an infinite loop, which would affect an ordinary Prolog system: :- table anc/2.

Rules for Queries 2 and 3 can be written in XSB as follows: area(P, R) :- dissertation(_, P, _, _,_ ,_ ,R), R \= ’’. univ(C, U) :- dissertation(_, C, _, _, U, _, _). % Same university as advisor q2(N) :- adv(X, Y), univ(X, U), univ(Y, U), person(Y, N). % Different area than advisor q3(N) :- adv(X, Y), area(X, R1), area(Y, R2), R1 \= R2, person(Y, N). % Enumerate all answers to Queries 2 and 3 enumallq2 :- q2(_), fail; true. enumallq3 :- q3(_), fail; true.

Query 5 can be implemented using the aggregate construct findall in XSB, which returns all answers to a query as a list via the last argument: % Count the ancestors of DSW q5(Count) :- findall(A, q4(A), S), length(S, Count).

1.8 Current Systems and Comparison

87

The same query can also be implemented using the table aggregation construct in XSB as follows: % Count the anccestors of DSW :- table q5_2(fold(cnt/3, 0)). q5_2(X) :- q4(X). cnt(C, _, C1) :- C1 is C + 1. ?- q5_2(Count).

The special tabling directive in the first line works as follows: whenever an answer acc for q4 is generated, XSB calls cnt(acc, _, NewAcc), removes acc from the table, and adds NewAcc. For the first answer found, the first argument in the call to cnt would be 0, as indicated by the tabling directive. So q5_2 returns the number of facts found for q4.

1.8.3 LogicBlox LogicBlox [Aref et al. 2015] is a commercial system unifying the programming model for enterprise software development that combines transactions with analytics by using a flavor of Datalog called LogiQL. LogiQL is a strongly typed, extended form of Datalog that allows coding of entire enterprise applications, including business logic, workflows, user interfaces, statistical modeling, and optimization tasks. LogicBlox evaluates LogiQL rules in a bottom-up fashion. In terms of the language, in addition to pure Datalog, LogiQL has functional predicates that map keys to values, and various aggregation operators. In LogiQL, the arguments of each EDB predicate need to be typed. For our queries, these types can be specified as follows: person[pid] = name -> int(pid), string(name). advised(aid,did) -> int(aid), int(did). dissertation(did,cid,title,degree,university,year,area) -> int(did), int(cid), string(title), string(degree), string(university), string(year), string(area).

In the specification above, person is a functional predicate (shown by the bracket notation), mapping person IDs to names. Using these specifications, Query 1 can be answered with the following rules: author[did]=cid, univ[cid]=university, area[cid]=area Person|].

These declarations specify the types of the attributes for classes Dissertation and Person (=> means “has type”). In this application, these declarations are not required, but they make the schema of the database explicit and easier to grasp. Since the object-oriented schema above is quite different from the original relational schema in Section 1.8.1, the next pair of rules specifies the “bridge” between the two schemes: ?did:Dissertation[cid->?cid, advisor->?aid, title->?ttl, area->?ar, university->?uni, degree->?deg, year->?yr] :dissertation(?did,?cid,?ttl,?deg,?uni,?yr,?ar), advised(?aid,?did). ?pid:Person[name->?nm] :- person(?pid,?nm).

We recall that the symbol -> means “has value” and thus the first rule above says that, given a tuple d , c, t , g, u, y , ar in table dissertation and a tuple adv, d

92

Chapter 1 Datalog: Concepts, History, and Outlook

in advised, one can derive that d is an object in class Dissertation such that: its attribute cid has value c, the attribute advisor has a value adv, title has value t, and so on. Note that an attribute can have multiple values and so, if dissertation d was advised by several advisors, the advisor attribute for d will be multi-valued. The next batch of statements contains the actual queries for our running example. It starts with three auxiliary rules that define additional view-methods for classes Dissertation and Person. The first rule, for example, says that if there is a Dissertation-object (denoted by a name-less variable ?) that has a candidate with Id ?P (i.e., the attribute cid has value ?P) and an advisor ?A then that dissertation’s advisor is also an advisor for ?P. The next two rules recursively define the method ancAdvisor in class Person. // Utilities ?P[advisor->?A] :- ?:Dissertation[cid->?P, advisor->?A]. ?P[ancAdvisor->?AA] :- ?P:Person[advisor->?AA]. ?P[ancAdvisor->?AA] :- ?P:Person[advisor->?[ancAdvisor->?AA]]. // Queries // Grand-advisors of DSW q1(?Name) :- ?:Person[name->’David Scott Warren’, advisor->?A], ?A[advisor->?[name->?Name]]. // Same university as advisor q2(?Name) :- ?P:Person[name->?Name], ?:Dissertation[cid->?P, advisor->?A, university->?U], ?:Dissertation[cid->?A, university->?U]. // Different area than advisor q3(?Name) :- ?P:Person[name->?Name], ?:Dissertation[cid->?P, advisor->?A, area->?R1], ?:Dissertation[cid->?A, area->?R2], ?R1 != ?R2. // Ancestors of DSW q4(?Name) :- ?:Person[name->’David Scott Warren’, ancAdvisor->?[name->?Name]]. // Count the ancestors of DSW q5(count{?Name|q4(?Name)}).

The last rule is interesting because it appears in the form of a fact, but is really a shorthand for a rule like q5(?Count) :- ?Count = count{?Name|q4(?Name)}.

1.8 Current Systems and Comparison

93

but count{...} is an evaluable aggregate function that can be placed directly as an argument to q5. This last query provides a glimpse of how logical and functional styles can be mixed in Flora-2.

1.8.6 Experiments In this section, we report experimental results on two of the four modern systems we have discussed: XSB and LogicBlox. The systems that we have chosen for experiments are representative of two schools of evaluation, top-down and bottom-up. Each school has its own challenges, and we will illustrate those challenges via experiments, and show techniques to tackle them. An issue pertinent to both schools of evaluation is indexing. During an evaluation, a system needs to retrieve facts for a predicate given some bound arguments. XSB, by default, generates an index on the first argument of each predicate. However, as we will illustrate, that index may not be enough. In such cases, XSB allows the user to specify additional indexing directives. LogicBlox automatically generates relevant indexes, and the user cannot affect the system’s choices. We will illustrate the importance of indexing in XSB via experiments. An issue pertinent only to top-down evaluation is repeated calls to subgoals, as described in Section 1.5.2.1. We will show how the user specifies tabling directives, and the difference in query evaluation time with and without tabling in XSB. An issue pertinent only to bottom-up evaluation is that, in its basic form, it does not take the particular query into account. Many methods exist to transform the rules so that the query is taken into account, most notably the Magic Sets transformation [Bancilhon et al. 1986]. We will use demand transformation instead [Tekle and Liu 2010], since it is a simpler method and its space complexity can be exponentially less (in the number of arguments of predicates) than Magic Sets. We will illustrate the effect of demand transformation via experiments in the LogicBlox system. Another impact on performance for a system is the inclusion of database features such as atomicity and durability. These require various mechanisms, including locks and writes to disk, which impact performance negatively. LogicBlox has various such features, but XSB does not, and the experiment results should be evaluated in this light as well. Optimization of Datalog queries by transforming rules and queries has been widely studied. These include specialization and recursion conversion [Tekle et al. 2008]; the latter is illustrated for relevant queries for both systems.

94

Chapter 1 Datalog: Concepts, History, and Outlook

We now show our experiments for each query. We performed the experiments on a 2.8 GHz Intel Core i5 with 8 GB RAM, using XSB 3.7 and LogicBlox 4.3.10. We performed each experiment 10 times and took the average times of the runs. Loading Input Data. Input data load times are important to the viability of a system,

but it is a task that need be performed only once, and does not change with respect to a given query. For this reason, we separate loading input data from query answering in our experiments. In order to load the input data into XSB, we run the following query: :- load_dync(data).

and in order to load the input data into LogicBlox, we install the definitions of the input predicates, and then run import scripts via its TDX mechanism. Setting up the environment and loading of input data as described above takes 3.91 s in XSB, and 14.9 s in LogicBlox. Query 1.1

The first query asks for the grand-advisors of David Scott Warren. XSB.

Recall the rules and query in XSB:

author(D, C) :- dissertation(D, C, _, _, _, _, _). adv(X, Y) :- advised(X, DID), author(DID, Y). % Grand-advisors of DSW q1(N) :- person(D,’David Scott Warren’), adv(X, D), adv(Y, X), person(Y, N). % Enumerate all results to Query 1 enumallq1 :- q1(_), fail; true. ?- enumallq1.

Note that a call to q1 will be made with its argument as a free variable, and therefore the following calls will be made in the body of the rule for q1: (i) a call to person with the second argument bound due to the first subgoal, since that argument is a constant; (ii) a call to adv with the second argument bound due to the second subgoal, since D will be bound by the first subgoal; (iii) another call to adv with the second argument bound due to the third subgoal, since X will be bound by the second subgoal; and (iv) another call to person with the first argument bound due to the final subgoal, since Y will be bound by the third subgoal. The rules and query may be analyzed for each relevant predicate to find binding patterns for predicates, as shown above. For optimal performance, it is critical

1.8 Current Systems and Comparison

95

to have indexes corresponding to each binding pattern for the IDB predicates. As noted, XSB provides indexes on the first argument by default. For each non-default index, we write the indexing directives as follows: :- index(person/2, [2, 1]).

The query is answered in 261 ms without the index, and in 238 ms with the index, 9.6% faster. The effect is not so dramatic, since only one subgoal binds the second argument of person. We will see more dramatic effects in other queries. LogicBlox. Recall that in LogicBlox, we do not have control over the indexing

mechanism. Running the first query takes 1400 ms. However, LogicBlox does not take the query into account, and therefore will infer facts for the adv predicate for every possible pair, but the only relevant ones are David Scott Warren’s advisors, and their advisors in turn. To avoid the extra computation, we can perform demand transformation, which yields the following rules: adv(x, y)