Numerical Algorithms for Personalized Search in Self-organizing Information Networks [Course Book ed.] 9781400837069

This book lays out the theoretical groundwork for personalized search and reputation management, both on the Web and in

147 82 858KB

English Pages 160 [155] Year 2010

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Numerical Algorithms for Personalized Search in Self-Organizing Information Networks [1 ed.] 9781400837069, 9780691145037

This book lays out the theoretical groundwork for personalized search and reputation management, both on the Web and in

137 22 6MB Read more

Cohesive Subgraph Search Over Large Heterogeneous Information Networks 3030975673, 9783030975678

This SpringerBrief provides the first systematic review of the existing works of cohesive subgraph search (CSS) over lar

173 71 2MB Read more

Numerical Algorithms (Stanford CS205 Textbook)

814 74 4MB Read more

Parallel Algorithms for Numerical Linear Algebra [1st Edition] 9781483295732

This is the first in a new series of books presenting research results and developments concerning the theory and applic

761 71 22MB Read more

Modal interval analysis. New tools for numerical information 9783319017204, 9783319017211

752 101 1MB Read more

Wireless Sensor Networks: Evolutionary Algorithms for Optimizing Performance 0367342413, 9780367342418

Wireless Sensor Networks: Evolutionary Algorithms for Optimizing Performance provides an integrative overview of bio-ins

1,141 239 5MB Read more

Algorithms for Next Generation Networks 9781848827653, 9781848827646, 1848827652

Data networking now plays a major role in everyday life and new applications continue to appear at a blinding pace. Yet

179 43 5MB Read more

Методы и алгоритмы поиска информации в Интернете = Search methods and algorithms for information retrieval on the Internet: монография

361 84 34MB Read more

Методы и алгоритмы поиска информации в интернете: Search methods and algorithms for information retrieval on the Internet : монография 9785950050183

247 67 8MB Read more

Neural Networks and Numerical Analysis 9783110783186, 9783110783124

This book uses numerical analysis as the main tool to investigate methods in machine learning and neural networks. The e

238 85 19MB Read more

Author / Uploaded
Sep Kamvar

Table of contents :
Contents
Tables
Figures
Acknowledgments
Chapter One. Introduction
PART I. World Wide Web
Chapter Two. PageRank
Chapter Three. The Second Eigenvalue of the Google Matrix
Chapter Four. The Condition Number of the PageRank Problem
Chapter Five. Extrapolation Algorithms
Chapter Six. Adaptive PageRank
Chapter Seven. BlockRank
PART II. P2P Networks
Chapter Eight. Query-Cycle Simulator
Chapter Nine. Eigen Trust
Chapter Ten. Adaptive P2P Topologies
Chapter Eleven. Conclusion
Bibliography

Citation preview

Numerical Algorithms for Personalized Search in Self-organizing Information Networks

This page intentionally left blank

Numerical Algorithms for Personalized Search in Self-organizing Information Networks

Sep Kamvar

PRINCETON UNIVERSITY PRESS PRINCETON AND OXFORD

c 2010 by Princeton University Press Copyright Published by Princeton University Press, 41 William Street, Princeton, New Jersey 08540 In the United Kingdom: Princeton University Press, 6 Oxford Street, Woodstock, Oxfordshire OX20 1TW All Rights Reserved Library of Congress Cataloging-in-Publication Data Kamvar, Sep, 1977– Numerical algorithms for personalized search in self-organizing information networks / Sep Kamvar. p. cm. Includes bibliographical references. ISBN 978-0-691-14503-7 (hardcover : alk. paper) 1. Database searching–Mathematics. 2. Information networks–Mathematics. 3. Content analysis (Communication)–Mathematics. 4. Self-organizing systems–Data processing. 5. Algorithms. 6. Internet searching–Mathematics. I. Title. ZA4460.K36 2010 025.5 24–dc22 2010014915 British Library Cataloging-in-Publication Data is available Printed on acid-free paper. ∞ press.princeton.edu Typeset by S R Nova Pvt Ltd, Bangalore, India Printed in the United States of America 10 9 8 7 6 5 4 3 2 1

Contents

Tables

ix

Figures

xi

Acknowledgments

xv

Chapter 1 Introduction

1.1 1.2 1.3

PART I

World Wide Web P2P Networks Contributions

1 2 2

WORLD WIDE WEB

5

Chapter 2 PageRank

2.1 2.2 2.3

2.4 2.5

PageRank Basics Notation and Mathematical Preliminaries Power Method 2.3.1 Formulation 2.3.2 Operation Count 2.3.3 Convergence Experimental Setup Related Work 2.5.1 Fast Eigenvector Computation 2.5.2 PageRank

Chapter 3 The Second Eigenvalue of the Google Matrix

3.1 3.2 3.3 3.4 3.5 3.6

1

Introduction Theorems Proof of Theorem 1 Proof of Theorem 2 Implications Theorems Used

7

7 9 10 10 12 12 13 13 13 14 15

15 15 15 17 18 19

vi

CONTENTS

Chapter 4 The Condition Number of the PageRank Problem

4.1 4.2 4.3

Theorem 6 Proof of Theorem 6 Implications

Chapter 5 Extrapolation Algorithms

5.1 5.2

5.3

5.4

5.5

Introduction Aitken Extrapolation 5.2.1 Formulation 5.2.2 Operation Count 5.2.3 Experimental Results 5.2.4 Discussion Quadratic Extrapolation 5.3.1 Formulation 5.3.2 Operation Count 5.3.3 Experimental Results 5.3.4 Discussion Power Extrapolation 5.4.1 Simple Power Extrapolation 5.4.2 A2 Extrapolation 5.4.3 Ad Extrapolation Measures of Convergence

Chapter 6 Adaptive PageRank

6.1 6.2 6.3

6.4 6.5

6.6

Introduction Distribution of Convergence Rates Adaptive PageRank Algorithm 6.3.1 Algorithm Intuition 6.3.2 Filter-based Adaptive PageRank Experimental Results Extensions 6.5.1 Further Reducing Redundant Computation 6.5.2 Using the Matrix Ordering from the Previous Computation Discussion

Chapter 7 BlockRank

7.1

7.2

Block Structure of the Web 7.1.1 Block Sizes 7.1.2 The GeoCities Effect BlockRank Algorithm 7.2.1 Overview of BlockRank Algorithm 7.2.2 Computing Local PageRanks 7.2.3 Estimating the Relative Importance of Each Block

20

20 20 21 23

23 23 23 25 26 26 27 27 30 30 34 35 35 35 37 40 42

42 42 44 45 46 48 48 48 50 50 51

51 54 55 55 56 57 60

vii

CONTENTS

7.2.4

7.3 7.4 7.5 7.6

Approximating Global PageRank Using Local PageRank and BlockRank 7.2.5 Using This Estimate as a Start Vector Advantages of BlockRank Experimental Results Discussion Personalized PageRank 7.6.1 Inducing Random Jump Probabilities over Pages 7.6.2 Using “Better” Local PageRanks 7.6.3 Experiments 7.6.4 Topic-Sensitive PageRank 7.6.5 Pure BlockRank

61 62 63 64 67 67 68 68 69 70 71

PART II P2P NETWORKS

73

Chapter 8 Query-Cycle Simulator

75

8.1 8.2 8.3

8.4 8.5

8.6

8.7

8.8

Challenges in Empirical Evaluation of P2P Algorithms The Query-Cycle Model Basic Properties 8.3.1 Network Topology 8.3.2 Joining the Network 8.3.3 Query Propagation Peer-Level Properties Content Distribution Model 8.5.1 Data Volume 8.5.2 Content Type Peer Behavior Model 8.6.1 Uptime and Session Duration 8.6.2 Query Activity 8.6.3 Queries 8.6.4 Query Responses 8.6.5 Downloads Network Parameters 8.7.1 Topology 8.7.2 Bandwidth Discussion

Chapter 9 EigenTrust

9.1 9.2 9.3

Design Considerations Reputation Systems EigenTrust 9.3.1 Normalizing Local Trust Values 9.3.2 Aggregating Local Trust Values 9.3.3 Probabilistic Interpretation

75 75 76 76 76 76 77 78 78 78 80 80 81 81 81 82 82 82 82 83 84

84 85 86 86 87 87

viii

9.4

9.5 9.6

9.7 9.8

CONTENTS

9.3.4 Basic EigenTrust 9.3.5 Practical Issues 9.3.6 Distributed EigenTrust 9.3.7 Algorithm Complexity Secure EigenTrust 9.4.1 Algorithm Description 9.4.2 Discussion Using Global Trust Values Experiments 9.6.1 Load Distribution in a Trust-based Network 9.6.2 Threat Models Related Work Discussion

Chapter 10 Adaptive P2P Topologies

10.1 10.2 10.3

10.4

10.5

10.6 10.7

Introduction Interaction Topologies Adaptive P2P Topologies 10.3.1 Local Trust Scores 10.3.2 Protocol 10.3.3 Practical Issues Empirical Results 10.4.1 Malicious Peers Move to Fringe 10.4.2 Freeriders Move to Fringe 10.4.3 Active Peers Are Rewarded 10.4.4 Efficient Topology Threat Scenarios 10.5.1 Threat Model A 10.5.2 Threat Model B 10.5.3 Threat Model C Related Work Discussion

87 88 89 90 91 92 93 94 95 95 98 106 106 108

108 109 109 109 110 112 115 115 118 119 120 126 126 128 130 131 132

Chapter 11 Conclusion

133

Bibliography

135

Tables Wallclock speedups for Ad Extrapolation, for d ∈ 1, 2, 4, 6, 8, and Quadratic Extrapolation. 6.1 Statistics about pages in the Stanford.EDU dataset whose convergence times are quick (ti ≤ 15) and pages whose convergence times are long (ti > 15). 7.1 Example illustrating our terminology using the sample url http://cs.stanford.edu/research/. 7.2 Hyperlink statistics on LargeWeb for the full graph (Full: 291M nodes, 1.137B links) and for the graph with dangling nodes removed (DNR: 64.7M nodes, 607M links). 7.3 The “closeness” as measured by average (a) absolute error, and (b) KDist distance of the local PageRank vectors lJ and the global PageRank segments gJ , compared to the closeness of uniform vectors vJ and the global PageRank segments gJ for the Stanford/ Berkeley dataset. 7.4 The local PageRank vector lJ for the domain aa.stanford.edu (left) compared to the global PageRank segment gJ corresponding to the same pages. The local PageRank vector has a similar ordering to that of the normalized components of the global PageRank vector. The discrepancy in actual ranks is largely due to the fact that the local PageRank vector does not give enough weight to the root node http://aa.stanford.edu. 7.5 Running times for the individual steps of BlockRank for c = 0.85 in achieving a final residual of 0, then let xi = y1 /k1 − y2 /k2 . Note that xi is an eigenvector of P T with eigenvalue exactly 1 and that xi is orthogonal to e. From Lemma 1, x2 is an eigenvector of A corresponding to eigenvalue c. Therefore, the eigenvalue λi of A corresponding to eigenvector xi is λi = c. Therefore, |λ2 | ≥ c, and there exists a λi = c. However, from Theorem 1, λ2 ≤ c. 2 Therefore, λ2 = c and Theorem 2 is proved.2

3.5 IMPLICATIONS Convergence of PageRank. The PageRank algorithm uses the Power Method to compute the principal eigenvector of A. The rate of convergence of the Power Method is given by |λ2 /λ1 | [30, 72]. Since the Web graph has been empirically shown to contain many irreducible closed subsets [13], Theorem 2 holds for the matrix A used in PageRank. For PageRank, the typical value of c has been given as 0.85; for this value of c, Theorem 2 thus implies that the convergence rate of the Power Method |λ2 /λ1 | for any Web link matrix A is 0.85. Therefore, the convergence rate of PageRank will be reasonably fast, even as the Web scales. Stability of PageRank to perturbations in the link-structure. The modulus of the nonprincipal eigenvalues also determines whether the corresponding Markov chain 2 Note

that there may be additional eigenvalues with modulus c, such as −c.

THE SECOND EIGENVALUE OF THE GOOGLE MATRIX

19

is well-conditioned. As shown by Meyer in [54], the greater the eigengap |λ1 |−|λ2 |, the more stable the stationary distribution is to perturbations in the Markov chain. Our analysis provides an alternate explanation for the stability of PageRank shown by Ng et al. [55]. Accelerating PageRank computations. Knowing the second eigenvalue of the Web matrix λ2 = c allows for the design of improved algorithms to compute PageRank. This will be discussed further in Chapter 5. Spam detection. The eigenvectors corresponding to the second eigenvalue λ2 = c are an artifact of certain structures in the Web graph. In particular, each pair of leaf nodes in the SCC graph for the chain P corresponds to an eigenvector of A with eigenvalue c. These leaf nodes in the SCC are those subgraphs in the Web link graph which may have incoming edges, but have no edges to other components. Link spammers often generate such structures in attempts to hoard rank. Analysis of the nonprincipal eigenvectors of A may lead to strategies for combating link spam. Broader implications. This proof has implication for spectral methods beyond Web search. For example, in the field of peer-to-peer networks, the EigenTrust reputation algorithm discussed in Chapter 9 computes the principal eigenvector of a matrix of the form defined in equation 2.2. This result shows that EigenTrust will converge quickly, minimizing network overhead. In the field of image segmentation, Perona and Freeman [57] present an algorithm that segments an image by thresholding the first eigenvector of the affinity matrix of the image. One may normalize the affinity matrix to be stochastic as in [53] and introduce a regularization parameter as in [56] to define a matrix of the form given in (2.2). The benefit of this is that one can choose the regularization parameter c to be large enough so that the computation of the dominant eigenvector is very fast, allowing the Perona-Freeman algorithm to work for very large scale images.

3.6 THEOREMS USED This section contains theorems that are proven elsewhere and are used in proving Theorems 1 and 2 of this book. Theorem 3 (from page 126 of [38]). If P is the transition matrix for a finite Markov chain, then the multiplicity of the eigenvalue 1 is equal to the number of irreducible closed subsets of the chain. Theorem 4 (from page 4 of [72]). If xi is an eigenvector of A corresponding to the eigenvalue λi , and yj is an eigenvector of AT corresponding to λj , then xiT yj = 0 (if λi = λj ). Theorem 5 (from page 82 of [37]). Two distinct states belonging to the same class (irreducible closed subset) have the same period. In other words, the property of having period d is a class property.

Chapter Four The Condition Number of the PageRank Problem In the previous chapter, we showed convergence properties of the PageRank problem. In this chapter, we focus on stability. In particular, the following shows that the PageRank problem is well-conditioned for values of c that are not very close to 1.

4.1 THEOREM 6 Theorem 6. Let P be an n × n row-stochastic matrix whose diagonal elements Pii = 0. Let c be a real number such that 0 ≤ c ≤ 1. Let E be the n × n rank-one row-stochastic matrix E = evT , where e is the n-vector whose elements are all ei = 1, and v is an n-vector that represents a probability distribution.1 Define the matrix A = [cP + (1 − c)E]T . The problem A x = x has condition number κ = (1 + c)/(1 − c).

4.2 PROOF OF THEOREM 6 We prove this case via a series of lemmas. Lemma 1. E T x = v. Proof. By definition, E = evT . Therefore, E T x = veT x. From (2.4), eT x = 1. 2 Therefore, E T x = v. Lemma 2. The eigenvalue problem A x = x can be rewritten as the nonsingular x = (1 − c) v. system of equations (I − cP T ) Proof. From A x = x, we can rearrange terms to get (I − A) x = 0. By the definition of A (equation 2.2) x = 0. [I − (cP + (1 − c)E)T ] x − (1 − c) v = 0. Rearranging From Lemma 1, E T x = v. Therefore, (I − cP T ) x = (1 − c) v. 2 terms, we get (I − cP T ) 1 i.e.,

a vector whose elements are nonnegative and whose L1 norm is 1.

THE CONDITION NUMBER OF THE PAGERANK PROBLEM

21

Lemma 3. x = (I − cP T )−1 v. Proof. Let M = I − cP T . Then M T = I − cP . Since P has zeros on the diagonals and is row-stochastic, and since c < 1, I − cP is strictly diagonally dominant and therefore invertible. Since M T is invertible, M is also invertible. Therefore, we may 2 write x = (I − cP T )−1 v. Lemma 4. ||I − cP T ||1 = 1 + c. Proof. Since the diagonal elements of cP T are all zero, ||I − cP T ||1 = ||I ||1 + c||P T ||1 = 1 + c||P T ||1 . Since P T is a column-stochastic matrix, ||P T ||1 = 1. Thus, ||I − cP T ||1 = 1 + c. 2 Lemma 5. ||(I − cP T )−1 ||1 = 1/(1 − c). Proof. Recall from (2.2) that A = [cP +(1−c)E]T , where E = evT and v is some n-vector whose elements are non-negative and sum to 1. Let x(ei ) be the n-vector that satisfies the following equations: v = ei , A x (ei ) = x(ei ), || x (ei )||1 = 1. From Lemma 2, x = (1−c)(I −cP T )−1 v. Therefore, x(ei ) = (1−c)(I −cP T )−1 ei . Taking the norm of both sides gives || x (ei )||1 = (1 − c)||(I − cP T )−1 ei ||1 . Since || x (ei )||1 = 1, we have ||(I − cP T )−1 ei ||1 = 1/(1 − c).

(4.1)

Notice that (I − cP T )−1 ei gives the ith column of (I − cP T )−1 . Thus, from (4.1), 2 the L1 norm of the matrix (I − cP T )−1 is ||(I − cP T )−1 || = 1/(1 − c). Lemma 6. The 1-norm condition number of x = (I − cP T )−1 v is κ = (1 + c)/ (1 − c). Proof. By definition, the 1-norm condition number κ of the problem y = M −1 b is given by κ = ||M||1 ||M −1 ||1 . From Lemmas 4 and 5, this is κ = (1 + c)/(1 − c). 2

4.3 IMPLICATIONS The strongest implication of this result has to do with the stability of PageRank. A proof of stability of PageRank is given in [55], but we show a tighter stability bound here. Imagine that the Google matrix A is perturbed slightly, either by modifying the link structure of the Web (by adding or taking away links), or by changing the value of c. Let us call this perturbed matrix A˜ = A + B, where B is the “error matrix” describing the change to the Web matrix A. Let x be the PageRank vector

22

CHAPTER 4

corresponding to the Web matrix A, and let x˜ be the vector corresponding to the ˜ It is known that, for a linear system of equations, Web matrix A. || x − x˜ ||1 ≤ κ||B||. From Theorem 6, we can rewrite this as 1+c || x − x˜ ||1 ≤ ||B||. 1−c What this means is that, for values of c near to 1, PageRank is not stable, and a small change in the link structure may cause a large change in PageRank. However, for smaller values of c such as those likely to be used by Google (.8 < c < .9), PageRank is stable, and a small change in the link structure will cause only a small change in PageRank. Another implication of this is the accuracy to which PageRank may be computed. Again, for values of c likely to be used by Google, PageRank is a well-conditioned problem meaning that it may be computed accurately by a stable algorithm. However, for values of c close to 1, PageRank is an ill-conditioned problem, and it cannot be computed to great accuracy by any algorithm.

Chapter Five Extrapolation Algorithms

5.1 INTRODUCTION The standard PageRank algorithm uses the Power Method to compute successive iterates that converge to the principal eigenvector of the Markov matrix A representing the Web link graph. Since it was shown in the previous chapter that the Power Method converges quickly, one approach to accelerating PageRank would be to directly modify the Power Method to exploit our knowledge about the matrix A. In this chapter, we present several algorithms that accelerate the convergence of PageRank by using successive iterates of the Power Method to estimate the nonprincipal eigenvectors of A, and periodically subtracting these estimates from the current iterate x (k) of the Power method. We call this class of algorithms extrapolation algorithms. We present three extrapolation algorithms in this chapter: Aitken Extrapolation and Quadratic Extrapolation exploit our knowledge that the first eigenvalue λ1 (A) = 1, and Power Extrapolation exploits our knowledge from the previous chapter that λ2 (A) = c. Empirically, we show that Quadratic Extrapolation and Power Extrapolation speed up PageRank computation by 25–300% on a Web graph of 80 million nodes, with minimal overhead. This contribution is useful to the PageRank community and the numerical linear algebra community in general, as these are fast methods for determining the dominant eigenvector of a Markov matrix that is too large for standard fast methods to be practical.

5.2 AITKEN EXTRAPOLATION 5.2.1 Formulation We begin by introducing Aitken Extrapolation, which we develop as follows. We assume that the iterate x(k−2) can be expressed as a linear combination of the first two eigenvectors. This assumption allows us to solve for the principal eigenvector u1 in closed form using the successive iterates x(k−2) , . . . , x(k) . Of course, x(k−2) can only be approximated as a linear combination of the first two eigenvectors, so the u1 that we compute is only an estimate of the true u1 . However, it can be seen from Section 2.3.1 that this approximation becomes increasingly accurate as k becomes larger.

24

CHAPTER 5

We begin our formulation of Aitken Extrapolation by assuming that x(k−2) can be expressed as a linear combination of the first two eigenvectors x(k−2) = u1 + α2 u2 .

(5.1)

Since the first eigenvalue λ1 of a Markov matrix is 1, we can write the next two iterates as x (k−2) = u1 + α2 λ2 u2 , x(k−1) = A

(5.2)

α2 λ22 u2 .

(5.3)

x

(k)

= A x

(k−1)

= u1 +

Now, let us define gi = (xi(k−1) − xi(k−2) )2 ,

(5.4)

hi = xi(k)

(5.5)

−

2xi(k−1)

+

xi(k−2) .

where xi represents the ith component of the vector x. Doing simple algebra using (5.2) and (5.3) gives gi = α22 (λ2 − 1)2 (u2 )2i ,

(5.6)

hi = α2 (λ2 − 1) (u2 )i .

(5.7)

2

Now, let us define fi as the quotient gi / hi : fi =

gi α22 (λ2 − 1)2 (u2 )2i = hi α2 (λ2 − 1)2 (u2 )i = α2 (u2 )i .

(5.8) (5.9)

Therefore, f = α2 u2 .

(5.10)

Hence, from (5.1), we have a closed-form solution for u1 : u1 = x(k−2) − α2 u2 = x(k−2) − f.

(5.11)

However, since this solution is based on the assumption that x(k−2) can be written as a linear combination of u1 and u2 , (5.11) gives only an approximation to u1 . Algorithms 3 and 4 show how to use Aitken Extrapolation in conjunction with the Power Method to get consistently better estimates of u1 . Aitken Extrapolation is equivalent to applying the well-known Aitken 2 method for accelerating linearly convergent sequences [4] to each component of the iterate x(k−2) . What is novel here is this derivation of Aitken acceleration, and the proof that Aitken acceleration computes the principal eigenvector of a Markov matrix in one step under the assumption that the power-iteration estimate x(k−2) can be expressed as a linear combination of the first two eigenvectors. As a sidenote, let us briefly develop a related method. Rather than using (5.4), let us define gi alternatively as gi = (xi(k−1) − xi(k−2) )(xi(k) − xi(k−1) ) = α22 λ2 (λ2 − 1)2 (u2 )2i . We define h as in (5.5), and fi now becomes fi =

α 2 λ2 (λ2 − 1)2 (u2 )2i gi = 2 = α2 λ2 (u2 )i . hi α2 (λ2 − 1)2 (u2 )i

EXTRAPOLATION ALGORITHMS

25

function x∗ = Aitken( x (k−2) , x(k−1) , x(k) ) { for i = 1 : n do gi = (xi(k−1) − xi(k−2) )2 ; hi = xi(k) − 2xi(k−1) + xi(k−2) ; xi∗ = xi(k) − gi / hi ; end } Algorithm 3: Aitken Extrapolation

function x(n) = AitkenPowerMethod() { x(0) = v; k = 1; repeat x (k−1) ; x(k) = A (k) δ = ||x − x (k−1) ||1 ; x (k−2) , x(k−1) , x(k) ); periodically, x(k) = Aitken( k = k + 1; until δ < ; } Algorithm 4: Power Method with Aitken Extrapolation

By (5.2), u1 = x (k−1) − α2 λ2 u2 = x(k−1) − f. Again, this is an approximation to u1 , since it is based on the assumption that x(k−2) can be expressed as a linear combination of u1 and u2 . What is interesting here is that this is equivalent to performing a second-order epsilon acceleration algorithm [73] on each component of the iterate x(k−2) . For this reason, we call this algorithm Epsilon Extrapolation. In experiments, Epsilon Extrapolation has performed similarly to Aitken Extrapolation, so we will focus on Aitken Extrapolation in the next sections. 5.2.2 Operation Count In order for an extrapolation method such as Aitken Extrapolation or Epsilon Extrapolation to be useful, the overhead should be minimal. By overhead, we mean any costs in addition to the cost of applying Algorithm 1 to generate iterates. It is clear from inspection that the operation count of the loop in Algorithm 3 is O(n), where n is the number of pages on the Web. The operation count of one extrapolation step is less than the operation count of a single iteration of the Power Method, and since Aitken Extrapolation may be applied only periodically, we say that Aitken Extrapolation has minimal overhead. In our implementation, the additional cost of each application of Aitken Extrapolation was negligible—about 1% of the cost of a single iteration of the Power Method (i.e., 1% of the cost of Algorithm 1).

26

CHAPTER 5

L1 residual

1

No Extrapolation Aitken Extrapolation

0.1

0.01

0.001 0

50

100 # of iterations

150

200

Figure 5.1 Comparison of convergence rate for unaccelerated Power Method and Aitken Extrapolation on the Stanford.EDU dataset, for c = 0.99. Extrapolation was applied at the 10th iteration.

5.2.3 Experimental Results In Figure 5.1, we show the convergence of the Power Method with Aitken Extrapolation applied once at the 10th iteration, compared to the convergence of the unaccelerated Power Method for the Stanford.EDU dataset. The x-axis denotes the number of times a multiplication A x occurred, that is, the number of times Algorithm 1 was needed. Note that there is a spike at the acceleration step, but speedup occurs nevertheless. This spike is caused by the poor approximation for u2 . The speedup likely occurs because, while the extrapolation step introduces some error in the components in the direction of each eigenvector, it is likely to actually reduce the error in the component in the direction of the second eigenvector, which is slow to go away by the standard Power Method. Multiple applications of Aitken Extrapolation are generally not more useful than only one. This is because, by the time the residual settles down from the first application, a second extrapolation approximation does not give a very different estimate from the current iterate itself. For c = 0.99, Aitken Extrapolation takes 38% less time to reach an iterate with a residual of 0.01. However, after this initial speedup, the convergence rate for Aitken slows down, so that to reach an iterate with a residual of 0.002, the time savings drops to 13%. For lower values of c, Aitken provides much less benefit. Since there is a spike in the residual graph, if Aitken Extrapolation is applied too often, the power iterations will not converge. In experiments, Epsilon Extrapolation performed similarly to Aitken Extrapolation. 5.2.4 Discussion The methods presented thus far are very different from standard fast eigensolvers, which generally rely strongly on matrix factorizations or matrix inversions.

27

EXTRAPOLATION ALGORITHMS

Standard fast eigensolvers do not work well for the PageRank problem, since the Web hyperlink matrix is so large and sparse. For problems where the matrix is small enough for an efficient inversion, standard eigensolvers such as inverse iteration are likely to be faster than these methods. The Aitken and Epsilon Extrapolation methods take advantage of the fact that the first eigenvalue of the Markov hyperlink matrix is 1 to find an approximation to the principal eigenvector. In the next section, we present Quadratic Extrapolation, which assumes the iterate can be expressed as a linear combination of the first three eigenvectors, and solves for u1 in closed form under this assumption. As we shall soon discuss, the Quadratic Extrapolation step is simply a linear combination of successive iterates, and thus does not produce spikes in the residual.

5.3 QUADRATIC EXTRAPOLATION 5.3.1 Formulation We develop the Quadratic Extrapolation algorithm as follows. We assume that the Markov matrix A has only 3 eigenvectors, and that the iterate x(k−3) can be expressed as a linear combination of these 3 eigenvectors. These assumptions allow us to solve for the principal eigenvector u1 in closed form using the successive iterates x(k−3) , . . . , x(k) . Of course, A has more than 3 eigenvectors, and x(k−3) can only be approximated as a linear combination of the first 3 eigenvectors. Therefore, the u1 that we compute in this algorithm is only an estimate for the true u1 . We show empirically that this estimate is a better estimate to u1 than the iterate x(k−3) , and that our estimate becomes closer to the true value of u1 as k becomes larger. In Section 5.3.3 we show that by periodically applying Quadratic Extrapolation to the successive iterates computed in PageRank, for values of c close to 1, we can speed up the convergence of PageRank by a factor of over 3. We begin our formulation with the assumption stated at the beginning of this section x(k−3) = u1 + α2 u2 + α3 u3 .

(5.12)

We then define the successive iterates x (k−3) , x(k−2) = A

(5.13)

= A x

(k−2)

,

(5.14)

= A x

(k−1).

(5.15)

x

(k−1)

x

(k)

Since we assume A has 3 eigenvectors, the characteristic polynomial pA (λ) is given by (5.16) pA (λ) = γ0 + γ1 λ + γ2 λ2 + γ3 λ3 . A is a Markov matrix, so we know that the first eigenvalue λ1 = 1. The eigenvalues of A are also the zeros of the characteristic polynomial pA (λ). Therefore, pA (1) = 0 ⇒ γ0 + γ1 + γ2 + γ3 = 0.

(5.17)

28

CHAPTER 5

The Cayley-Hamilton theorem states that any matrix A satisfies its own characteristic polynomial pA (A) = 0 [30]. Therefore, by the Cayley-Hamilton theorem, for any vector z in Rn , pA (A)z = 0 ⇒ [γ0 I + γ1 A + γ2 A2 + γ3 A3 ]z = 0.

(5.18)

Letting z = x(k−3) , we have x (k−3) = 0. [γ0 I + γ1 A + γ2 A2 + γ3 A3 ]

(5.19)

γ0 x(k−3) + γ1 x(k−2) + γ2 x(k−1) + γ3 x(k) = 0.

(5.20)

From (5.13)–(5.15),

From (5.17), x(k−3) (−γ1 − γ2 − γ3 ) + γ1 x(k−2) + γ2 x(k−1) + γ3 x(k) = 0.

(5.21)

We may rewrite this as x (k−1) − x(k−3) )γ2 + ( x (k) − x(k−3) )γ3 = 0. ( x (k−2) − x(k−3) )γ1 + (

(5.22)

Let us make the following definitions: y(k−2) = x(k−2) − x(k−3) , y

(k−1)

y

(k)

= x

(k−1)

= x

(k)

− x

− x

(k−3)

(k−3)

,

.

We can now write (5.22) in matrix notation: (k−2) y(k−1) y(k) γ = 0. y

(5.23) (5.24) (5.25)

(5.26)

We now wish to solve for γ . Since we are not interested in the trivial solution γ = 0, we constrain the leading term of the characteristic polynomial γ3 : γ3 = 1.

(5.27)

We may do this because constraining a single coefficient of the polynomial does not affect the zeros.1 (5.26) is therefore written (k−2) γ1 (k−1) y y (5.28) = − y (k) . γ2 This is an overdetermined system, so we solve the corresponding least-squares problem, γ1 (5.29) = Y + y(k) , γ2 y (k−2) y(k−1) ). Now, (5.17), where Y + is the pseudoinverse of the matrix Y = ( (5.27), and (5.29) completely determine the coefficients of the characteristic polynomial pA (λ) (5.16). 1 i.e.,

5.27 fixes a scaling for γ .

29

EXTRAPOLATION ALGORITHMS

function x∗ = QuadraticExtrapolation( x (k−3) , . . . , x(k) ) { for j = k − 2 : k do y(j ) = x(j ) − x(k−3) ; end Y = y(k−2) y(k−1) ; γ3 =1; γ1 = −Y + y(k) ; γ2 γ0 = −(γ1 + γ2 + γ3 ); β0 = γ1 + γ2 + γ3 ; β1 = γ 2 + γ3 ; β2 = γ3 ; x∗ = β0 x(k−2) + β1 x(k−1) + β2 x(k) ; } Algorithm 5: Quadratic Extrapolation

We may now divide pA (λ) by λ − 1 to get the polynomial qA (λ), whose roots are λ2 and λ3 , the second two eigenvalues of A qA (λ) =

γ0 + γ1 λ + γ2 λ2 + γ3 λ3 = β0 + β1 λ + β2 λ2 . λ−1

(5.30)

Simple polynomial division gives the following values for β0 , β1 , and β2 : β0 = γ1 + γ2 + γ3 ,

(5.31)

β1 = γ2 + γ3 ,

(5.32)

β2 = γ3 .

(5.33)

Again, by the Cayley-Hamilton theorem, if z is any vector in Rn , qA (A)z = u1

(5.34)

where u1 is the eigenvector of A corresponding to eigenvalue 1 (the principal eigenvector). Letting z = x(k−2) , we have x (k−2) = [β0 I + β1 A + β2 A2 ] x (k−2) . u1 = qA (A)

(5.35)

From (5.13)–(5.15), we get a closed form solution for u1 : u1 = β0 x(k−2) + β1 x(k−1) + β2 x(k) .

(5.36)

However, since this solution is based on the assumption that A has only 3 eigenvectors, (5.36) gives only an approximation to u1 . Algorithms 5 and 6 show how to use Quadratic Extrapolation in conjunction with the Power Method to get consistently better estimates of u1 .

30

CHAPTER 5

function x(n) = QuadraticPowerMethod() { x(0) = v; k = 1; repeat x (k−1) ; x(k) = A (k) δ = ||x − x (k−1) ||1 ; periodically, x (k−3) , . . . , x(k) ); x(k) = QuadraticExtrapolation( k = k + 1; until δ < ; } Algorithm 6: Power Method with Quadratic Extrapolation

1. Compute the reduced QR factorization Y = QR using 2 steps of Gram-Schmidt. 2. Compute the vector −QT y (k) . 3. Solve system: the upper triangular γ γ1 = −QT y (k) for 1 using back substitution. R γ2 γ2 Algorithm 7: Using Gram-Schmidt to solve for γ1 and γ2

5.3.2 Operation Count The overhead in performing the extrapolation shown in Algorithm 5 comes primarily from the least-squares computation of γ1 and γ2 : γ1 = −Y + y(k) . γ2 It is clear that the other steps in this algorithm are either O(1) or O(n) operations. Since Y is an n × 2 matrix, we can do the least-squares solution cheaply in just 2 iterations of the Gram-Schmidt algorithm [68]. Therefore, γ1 and γ2 can be computed in O(n) operations. While a presentation of Gram-Schmidt is outside the scope of this book, we show in Algorithm 7 how to apply Gram-Schmidt to solve for [γ1 γ2 ]T in O(n) operations. Since the extrapolation step is on the order of a single iteration of the Power Method, and since Quadratic Extrapolation is applied only periodically during the Power Method, we say that Quadratic Extrapolation has low overhead. In our experimental setup, the overhead of a single application of Quadratic Extrapolation is half the cost of a standard power iteration (i.e., half the cost of Algorithm 1). This number includes the cost of storing on disk the intermediate data required by Quadratic Extrapolation (such as the previous iterates), since they may not fit in main memory. 5.3.3 Experimental Results Of the algorithms we have discussed for accelerating the convergence of PageRank, Quadratic Extrapolation performs the best empirically. In particular, Quadratic

31

EXTRAPOLATION ALGORITHMS

L1 residual

1

No Extrapolation Quadratic Extrapolation

0.1

0.01

0.001 0

5

10

15 20 25 # of iterations

30

35

40

Figure 5.2 Comparison of convergence rates for Power Method and Quadratic Extrapolation on LargeWeb for c = 0.90. Quadratic Extrapolation was applied the first 5 times that 3 successive power iterates were available.

Extrapolation considerably improves convergence relative to the Power Method when the damping factor c is close to 1. We measured the performance of Quadratic Extrapolation under various scenarios on the LargeWeb dataset. Figure 5.2 shows the rates of convergence when c = 0.90; after factoring in overhead, Quadratic Extrapolation reduces the time needed to reach a residual of 0.001 by 23%.2 Figure 5.3 shows the rates of convergence when c = 0.95; in this case, Quadratic Extrapolation speeds up convergence more significantly, saving 31% in the time needed to reach a residual of 0.001. Finally, in the case where c = 0.99, the speedup is more dramatic. Figure 5.4 shows the rates of convergence of the Power Method and Quadratic Extrapolation for c = 0.99. Because the Power Method is so slow to converge in this case, we plot the curves until a residual of 0.01 is reached. The use of extrapolation saves 69% in time needed to reach a residual of 0.01; that is, the unaccelerated Power Method took more than three times as long as the Quadratic Extrapolation method to reach the desired residual. The wallclock times for these scenarios are summarized in Figure 5.5. Figure 5.6 shows the convergence for the Power Method, Aitken Extrapolation, and Quadratic Extrapolation on the Stanford.EDU dataset; each method was carried out to 200 iterations. To reach a residual of 0.01, Quadratic Extrapolation saved 59% in time over the Power Method, compared to a 38% savings for Aitken Extrapolation. An important observation about Quadratic Extrapolation is that it does not necessarily need to be applied very often to achieve maximum benefit. By contracting the error in the current iterate along the direction of the second and third 2 The

time savings we give factor in the overhead of applying extrapolation, and represent wallclock time savings.

32

CHAPTER 5

L1 residual

1

No Extrapolation Quadratic Extrapolation

0.1

0.01

0.001 0

10

20

30 40 50 # of iterations

60

70

80

Figure 5.3 Comparison of convergence rates for Power Method and Quadratic Extrapolation on LargeWeb for c = 0.95. Quadratic Extrapolation was applied 5 times.

L1 residual

1

No Extrapolation Quadratic Extrapolation

0.1

0.01 0

20

40

60 80 # of iterations

100

120

140

Figure 5.4 Comparison of convergence rates for Power Method and Quadratic Extrapolation on LargeWeb when c = 0.99. Quadratic Extrapolation was applied all 11 times possible.

eigenvectors, Quadratic Extrapolation actually enhances the convergence of future applications of the standard Power Method. The Power Method, as discussed previously, is very effective in annihilating error components of the iterate in directions along eigenvectors with small eigenvalues. By subtracting approximations to the second and third eigenvectors, Quadratic Extrapolation leaves error components primarily along the smaller eigenvectors, which the Power Method is better equipped to eliminate.

33

EXTRAPOLATION ALGORITHMS

1400

Power Method Quadratic Extrapolation

1200

Minutes

1000 800 600 400 200 0 0.9

0.95

0.99

Value for c Figure 5.5 Comparison of wallclock times taken by Power Method and Quadratic Extrapolation on LargeWeb for c = {0.90, 0.95, 0.99}. For c = {0.90, 0.95}, the residual tolerance was set to 0.001, and for c = 0.99, it was set to 0.01.

1

No Extrapolation Aitken Extrapolation Quadratic Extrapolation

L1 residual

0.1

0.01

0.001

0.0001 0

50

100 150 # of iterations

200

250

Figure 5.6 Comparison of convergence rates for Power Method, Aitken Extrapolation, and Quadratic Extrapolation on the Stanford.EDU dataset for c = 0.99. Aitken Extrapolation was applied at the 10th iteration, Quadratic Extrapolation was applied every 15th iteration. Quadratic Extrapolation performs the best by a considerable degree. Aitken suffers from a large spike in the residual when first applied.

34

CHAPTER 5

L1 residual

1

Quadratic Extrapolation (5) Quadratic Extrapolation (14)

0.1

0.01

0.001 0

5

10

15

20 25 30 35 # of iterations

40

45

50

Figure 5.7 Comparison of convergence rates for Quadratic Extrapolation on LargeWeb for c = 0.95, under two scenarios: Extrapolation was applied the first 5 possible times in one case, and all 14 possible times in the other. Applying it only 5 times achieves nearly the same benefit in this case.

For instance, in Figure 5.7, we plot the convergence when Quadratic Extrapolation is applied 5 times compared with when it is applied as often as possible (in this case, 14 times), to achieve a residual of 0.001. Note that the additional applications of Quadratic Extrapolation do not lead to much further improvement. In fact, once we factor in the 0.5 iteration-cost of each application of Quadratic Extrapolation, the case where it was applied 5 times ends up being faster. 5.3.4 Discussion Like Aitken and Epsilon Extrapolation, Quadratic Extrapolation makes the assumption that an iterate can be expressed as a linear combination of a subset of the eigenvectors of A in order to find an approximation to the principal eigenvector of A. In Aitken and Epsilon Extrapolation, we assume that x(k−2) can be written as a linear combination of the first two eigenvectors, and in Quadratic Extrapolation, we assume that x(k−3) can be written as a linear combination of the first three eigenvectors. Since the assumption made in Quadratic Extrapolation is closer to reality, the resulting approximations are closer to the true value of the principal eigenvector of A. While Aitken and Epsilon Extrapolation are logical extensions of existing acceleration algorithms, Quadratic Extrapolation is completely novel. Furthermore, all these algorithms are general purpose. That is, they can be used to compute the principal eigenvector of any large, sparse Markov matrix, not just the Web graph. They should be useful in any situation where the size and sparsity of the matrix is such that a QR factorization is prohibitively expensive.

35

EXTRAPOLATION ALGORITHMS

Both Aitken and Quadratic Extrapolation assume that none of the nonprincipal eigenvalues of the hyperlink matrix are known. However, in the previous chapter, we showed that the modulus of the second eigenvalue of A is given by the damping factor c. Note that the Web graph can have many eigenvalues with modulus c (e.g., one of c, −c, ci, and −ci). In the next sections, we will discuss Power Extrapolation, which exploits known eigenvalues of A to accelerate PageRank.

5.4 POWER EXTRAPOLATION 5.4.1 Simple Power Extrapolation 5.4.1.1 Formulation As in the previous sections, the simplest extrapolation rule assumes that the iterate x(k−1) can be expressed as a linear combination of the eigenvectors u1 and u2 , where u2 has eigenvalue c: x(k−1) = u1 + α2 u2 .

(5.37)

Now consider the current iterate x ; because the Power Method generates iterates by successive multiplication by A, we can write x(k) as (k)

x (k−1) = A( u1 + α2 u2 ) = u1 + α2 λ2 u2 . x(k) = A

(5.38)

Plugging in λ2 = c, we see that x(k) = u1 + α2 cu2 .

(5.39)

This allows us to solve for u1 in closed form: u1 =

x(k) − cx(k−1) . 1−c

(5.40)

5.4.1.2 Results and Discussion Figure 5.8 shows the convergence of this simple power extrapolation algorithm (which we shall call Simple Extrapolation) and the standard Power Method, where there was one application of Simple Extrapolation at iteration 3 of the Power Method. Simple Extrapolation is not effective, as the assumption that c is the only eigenvalue of modulus c is inaccurate. In fact, by doubling the error in the eigenspace corresponding to eigenvalue −c, this extrapolation technique slows down the convergence of the Power Method. 5.4.2 A2 Extrapolation 5.4.2.1 Formulation The next extrapolation rule assumes that the iterate x(k−2) can be expressed as a linear combination of the eigenvectors u1 , u2 , and u3 , where u2 has eigenvalue c and u3 has eigenvalue −c: x(k−2) = u1 + α2 u2 + α3 u3 .

(5.41)

36

CHAPTER 5

10

NoAccel Simple

L1 residual

1 0.1 0.01 0.001 0.0001 1e-05 0

10

20

30 40 # of iterations

50

60

Figure 5.8 Comparison of convergence rates for Power Method and Simple Extrapolation on LargeWeb for c = 0.85.

Now consider the current iterate x(k) ; because the Power Method generates iterates by successive multiplication by A, we can write x(k) as x(k) = A2 x(k−2) = A2 ( u1 + α2 u2 + α3 u3 ) = u1 + α2 λ22 u2 + α2 λ23 u3 .

(5.42)

Plugging in λ2 = c and λ3 = −c, we see that x(k) = u1 + c2 (α2 u2 + α3 u3 ).

(5.43)

This allows us to solve for u1 in closed form: u1 =

x(k) − c2 x(k−2) . 1 − c2

(5.44)

A2 Extrapolation eliminates error along the eigenspaces corresponding to eigenvalues of c and −c.

5.4.2.2 Results and Discussion Figure 5.9 shows the convergence of A2 extrapolated PageRank and the standard Power Method where A2 Extrapolation was applied once at iteration 4. Empirically, A2 extrapolation speeds up the convergence of the Power Method by 18%. The initial effect of the application increases the residual, but by correctly subtracting much of the largest non-principal eigenvectors, the convergence upon further iterations of the Power Method is sped up.

37

EXTRAPOLATION ALGORITHMS

10

NoAccel A^2

L1 residual

1 0.1 0.01 0.001 0.0001 0

10

20

30 40 # of iterations

50

60

Figure 5.9 Comparison of convergence rates for Power Method and A2 Extrapolation on LargeWeb for c = 0.85.

5.4.3 Ad Extrapolation 5.4.3.1 Formulation The previous extrapolation rule made use of the fact that (−c)2 = c2 . We can generalize that derivation to the case where the eigenvalues of modulus c are given by cdi , where {di } are the dth roots of unity. From Theorem 2.1 of [35] and Theorem 1 given in the Appendix, it follows that these eigenvalues arise from leaf nodes in the strongly connected component (SCC) graph of the Web that are cycles of length d. Because we know empirically that the Web has such leaf nodes in the SCC graph, it is likely that eliminating error along the dimensions of eigenvectors corresponding to these eigenvalues will speed up PageRank. We make the assumption that x(k−d) can be expressed as a linear combination of the eigenvectors {u1 , . . . , ud+1 }, where the eigenvalues of {u2 , . . . , ud+1 } are the dth roots of unity, scaled by c: x(k−d) = u1 +

d+1

αi ui .

(5.45)

i=2

Then consider the current iterate x(k) ; because the Power Method generates iterates by successive multiplication by A, we can write x(k) as

d+1 d+1 (k) d (k−d) d x = A x = A u1 + αi ui = u1 + αi λdi ui . (5.46) i=2

i=2

But since λi is cdi , where di is a dth root of unity, x(k) = u1 + cd

d+1 i=2

αi ui .

(5.47)

38

CHAPTER 5

function x∗ = PowerExtrapolation( x (k−d) , x(k) ) { ∗ (k) d (k−d) d −1 x − c x )(1 − c ) ; x = ( } Algorithm 8: Power Extrapolation

function x(n) = ExtrapolatedPowerMethod(d) { x(0) = v; k = 1; repeat x (k−1) ; x(k) = A (k) δ = ||x − x (k−1) ||1 ; if k == d + 2, x (k−d) , x(k) ); x(k) = PowerExtrapolation( k = k + 1; until δ < ; } Algorithm 9: Power Method with Power Extrapolation

This allows us to solve for u1 in closed form: x(k) − cd x(k−d) . (5.48) 1 − cd For instance, for d = 4, the assumption made is that the nonprincipal eigenvalues of modulus c are given by c, −c, ci, and −ci (i.e., the 4th roots of unity). A graph in which the leaf nodes in the SCC graph contain only cycles of length l, where l is any divisor of d = 4 has exactly this property. Algorithms 8 and 9 show how to use Ad Extrapolation in conjunction with the Power Method. Note that Power Extrapolation with d = 1 is just Simple Extrapolation. The overhead in performing the extrapolation shown in Algorithm 8 comes from computing the linear combination ( x (k) −cd x(k−d) )(1−cd )−1 , an O(n) computation. In our experimental setup, the overhead of a single application of Power Extrapolation is 1% of the cost of a standard power iteration. Furthermore, Power Extrapolation needs to be applied only once to achieve the full benefit. u1 =

5.4.3.2 Results and Discussion In our experiments, Ad Extrapolation performs the best for d = 6. Figure 5.10 plots the convergence of Ad Extrapolation for d ∈ {1, 2, 4, 6, 8}, as well as of the standard Power Method, for c = 0.85 and c = 0.90. The wallclock speedups, compared with the standard Power Method, for these 5 values of d for c = 0.85 are given in Table 5.1. For comparison, Figure 5.11 compares the convergence of the Quadratic Extrapolated PageRank with A6 Extrapolated PageRank. Note that the speedup in convergence is similar; however, A6 Extrapolation is much simpler to implement, and

39

EXTRAPOLATION ALGORITHMS 10 1 L1 residual

(b)

NoAccel A^1 A^2 A^4 A^6 A^8

0.1

10

NoAccel A^1 A^2 A^4 A^6 A^8

1 L1 residual

(a)

0.01 0.001 0.0001

0.1 0.01 0.001 0.0001

1e-05

1e-05 0

10

20 30 40 # of iterations

50

60

0

10

20

30 40 50 60 # of iterations

70

80

90

Figure 5.10 Convergence rates for Ad Extrapolation, for d ∈ {1, 2, 4, 6, 8}, compared with standard Power Method.

10

NoAccel A6 Quad

L1 residual

1 0.1 0.01 0.001 0.0001 1e-05 0

10

20 30 40 # of iterations

50

60

Figure 5.11 Comparison of convergence rates for Power Method, A6 Extrapolation, and Quadratic Extrapolation on LargeWeb for c = 0.85.

Table 5.1 Wallclock speedups for Ad Extrapolation, for d ∈ 1, 2, 4, 6, 8, and Quadratic Extrapolation.

Type d=1 d=2 d=4 d=6 d=8 Quadratic

Speedup −28% 18% 26% 30% 22% 21%

40

CHAPTER 5

1

L1 Residual KDist Residual K_min Residual

Residual

0.1

0.01

0.001

0.0001 0

5

10

15 20 25 # of iterations

30

35

40

Figure 5.12 Comparison of the L1 residual versus KDist and Kmin for PageRank iterates. Note that the two curves nearly perfectly overlap, suggesting that in the case of PageRank, the easily calculated L1 residual is a good measure for the convergence of query-result rankings.

has negligible overhead, so that its wallclock-speedup is higher. In particular, each application of Quadratic Extrapolation requires about half of the cost of a standard iteration, and must be applied several times to achieve maximum benefit.

5.5 MEASURES OF CONVERGENCE In this section, we present empirical results demonstrating the suitability of the L1 residual, even in the context of measuring convergence of induced document rankings. In measuring the convergence of the PageRank vector, prior work has usually relied on δk = ||Ax (k) − x (k) ||p , the Lp norm of the residual vector, for p = 1 or p = 2, as an indicator of convergence [7, 56]. Given the intended application, we might expect that a better measure of convergence is the distance, using an appropriate measure of distance, between the rank orders for query results induced by Ax (k) and x (k) . We use two measures of distance for rank orders, both based on the the Kendall’s-τ rank correlation measure [47]: the KDist measure, defined below, and the Kmin measure, introduced by Fagin et al. in [27]. To determine whether the residual is a “good” measure of convergence, we compared it to the KDist and Kmin of rankings generated by Ax (k) and x (k) . We show empirically that in the case of PageRank computations, the L1 residual δk is closely correlated with the KDist and Kmin distances between query results generated using the values in Ax (k) and x (k) . We define the distance measure, KDist as follows. Consider two partially ordered lists of URLs, τ1 and τ2 , each of length k. Let U be the union of the URLs in τ1

41

EXTRAPOLATION ALGORITHMS

and τ2 . If ρ1 is U − τ1 , then let τ1 be the extension of τ1 , where τ1 contains ρ1 appearing after all the URLs in τ1 .3 We extend τ2 analogously to yield τ2 . KDist is then defined as |{(u, v) : τ1 , τ2 disagree on order of (u, v), u = v}| . (5.49) KDist(τ1 , τ2 ) = (|U |)(|U | − 1) In other words, KDist(τ1 , τ2 ) is the probability that τ1 and τ2 disagree4 on the relative ordering of a randomly selected pair of distinct nodes (u, v) ∈ U × U . To measure the convergence of PageRank iterations in terms of induced rank orders, we measured the KDist distance between the induced rankings for the top 100 results, averaged across 27 test queries, using successive power iterates for the LargeWeb dataset, with the damping factor c set to 0.9.5 The average residuals using the KDist, Kmin , and L1 measures are plotted in Figure 5.12.6 Surprisingly, the L1 residual is almost perfectly correlated with KDist, and is closely correlated with Kmin .7 A rigorous explanation for the close match between the L1 residual and the Kendall’s τ -based residuals is an interesting avenue of future investigation.

3 The 4A

URLs in ρ are placed with the same ordinal rank at the end of τ .

pair ordered in one list and tied in the other is considered a disagreement.

5 Computing

Kendall’s τ over the complete ordering of all of LargeWeb is expensive; instead we opt to compute KDist and Kmin over query results. The argument can also be made that it is more sensible to compute these measures over a randomly chosen subset of the web rather than the whole web, since small changes in ordering over low ranked pages rarely indicate an important change.

6 The

L1 residual δk is normalized so that δ0 is 1. 7 We emphasize that we have shown close agreement between L for distances between arbitrary vectors.

1

and KDist for measuring residuals, not

Chapter Six Adaptive PageRank

6.1 INTRODUCTION In the previous chapter, we exploited certain properties about the matrix A (namely, its known eigenvalues) to accelerate PageRank. In this chapter, we exploit certain properties of the convergence of the Power Method on the Web matrix to accelerate PageRank. Namely, we make the following simple observation: the convergence rates of the PageRank values of individual pages during application of the Power Method are nonuniform.1 That is, many pages converge quickly, with a few pages taking much longer. Furthermore, the pages that converge slowly are generally those pages with high PageRank. We devise a simple algorithm that exploits this observation to speed up the computation of PageRank, called Adaptive PageRank. In this algorithm, the PageRank of pages that have converged is not recomputed at each iteration after convergence. In large-scale empirical studies, this algorithm speeds up the computation of PageRank by 17%.

6.2 DISTRIBUTION OF CONVERGENCE RATES Table 6.1 and Figure 6.1 show convergence statistics for the pages in the Stanford.EDU dataset. We say that the PageRank xi of page i has converged when |xi(k+1) − xi(k) |/|xi |(k) < 10−3 . We choose this as the measure of convergence of an individual page rather than |xi(k+1) − xi(k) | to suggest that each individual page must be accurate relative to its own PageRank. We still use the L1 norm to detect convergence of the entire PageRank vector for the reasons that were discussed in the previous chapter. Table 6.1 shows the number of pages and average PageRanks of those pages that converge in fewer than 15 iterations, and those pages that converge in more than 15 iterations. Notice that most pages converge in fewer than 15 iterations, and their average PageRank is one order of magnitude lower than that for those pages that converge in more than 15 iterations. Figure 6.1(a) shows the profile of the bar graph where each bar represents a page i and the height of the bar is the convergence time ti of that page i. The pages 1 The rank of each individual page i corresponds to the individual components x (k) of the current iterate i x(k) of the Power Method.

43

ADAPTIVE PAGERANK

Table 6.1 Statistics about pages in the Stanford.EDU dataset whose convergence times are quick (ti ≤ 15) and pages whose convergence times are long (ti > 15).

ti ≤ 15 ti > 15 Total

Number of Pages 227597 54306 281903

Average PageRank 2.6642e-06 7.2487e-06 3.5473e-06 4

3.5

50

x 10

3 Number of Pages

Convergence Time

40

30

20

2.5 2 1.5 1

10 0.5 0 0

0.5

(a)

1

1.5 Pages

2

2.5

0 0

3 5

x 10

(c)

1.8

8

1.6

7

1.4

6

1.2 PageRank

PageRank

x 10

5 4

40

50

1

0.6

2

0.4

1

0.2 10

20 30 Convergence Times

40

x 10

0.8

3

0 0

20 30 Convergence Time

−5

−5

9

10

(b)

0

50

(d)

7

14

21 28 35 42 Convergence Times

49

Figure 6.1 Experiments on Stanford.EDU dataset. (a) Profile of bar graph where each bar represents a page i, and its height represents its convergence time ti . (b) Bar graph where the x-axis represents the discrete convergence time t, and the height of ti represents the number of pages that have convergence time t. (c) Bar graph where the height of each bar represents the average PageRank of the pages that converge in a given convergence time. (d) Bar graph where the height of each bar represents the average PageRank of the pages that converge in a given interval.

are sorted from left to right in order of convergence times. Notice that most pages converge in under 15 iterations, but some pages require more than 40 iterations to converge. Figure 6.1(b) shows a bar graph where the height of each bar represents the number of pages that converge at a given convergence time. Again, notice that

44

CHAPTER 6 1 Proportion of Pages (Cumulative)

0.14

Proportion of Pages

0.12 0.1 0.08 0.06 0.04 0.02 0 (a)

0

5

10

15 20 25 30 Convergence Time

35

40

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

45

0

(b)

5

10

15 20 25 30 Convergence Time

35

40

45

Figure 6.2 Experiments on the LargeWeb dataset. (a) Bar graph where the x-axis represents the convergence time t in number of iterations, and the height of bar ti represents the proportion of pages that have convergence time t. (b) Cumulative plot of convergence times. The x-axis gives the time t in number of iterations, and the y-axis gives the proportion of pages that have a convergence time ≤ t.

most pages converge in under 15 iterations, but there are some pages that take over 40 iterations to converge. Figure 6.1(c) shows a bar graph where the height of each bar represents the average PageRank of the pages that converge in a given convergence time. Notice that those pages that converge in fewer than 15 iterations generally have a lower PageRank than pages that converge in more than 50 iterations. This is illustrated in Figure 6.1(d) as well, where the height of each bar represents the average PageRank of those pages that converge within a certain interval (i.e. the bar labeled “7” represents the pages that converge in anywhere from 1 to 7 iterations, and the bar labeled “42” represents the the pages that converge in anywhere from 36 to 42 iterations). Figures 6.2 and 6.3 show some statistics for the LargeWeb dataset. Figure 6.2(a) shows the proportion of pages whose ranks converge to a relative tolerance of .001 in each iteration. Figure 6.2(b) shows the cumulative version of the same data; that is, it shows the percentage of pages that have converged up through a particular iteration. We see that in 16 iterations, the ranks for more than two-thirds of pages have converged. Figure 6.3 shows the average PageRanks of pages that converge in various iterations. Notice that those pages that are slow to converge tend to have higher PageRank.

6.3 ADAPTIVE PAGERANK ALGORITHM The skewed distribution of convergence times shown in the previous section suggests that the running time of the PageRank algorithm can be significantly reduced by not recomputing the PageRanks of the pages that have already converged. This is the basic idea behind the Adaptive PageRank algorithm that we motivate and present in this section.

45

ADAPTIVE PAGERANK −8

7

x 10

6 5 4 3 2 1 0

0

10

20

30

40

50

Figure 6.3 Average PageRank versus convergence time (in number of iterations) for the LargeWeb dataset. Note that pages that are slower to converge to a relative tolerance of .001 tend to have high PageRank.

6.3.1 Algorithm Intuition We begin by describing the intuition behind the Adaptive PageRank algorithm. We consider next a single iteration of the Power Method, and show how we can reduce the cost. Consider that we have completed k iterations of the Power Method. Using the iterate x(k) , we now wish to generate the iterate x(k+1) . Let C be set of pages that have converged to a given tolerance, and N be the set of pages that have not yet converged. We can split the matrix A defined in Section 2.2 into two submatrices. Let AN be the m × n submatrix corresponding to the inlinks of those m pages whose PageRanks have not yet converged, and let AC be the (n−m)×n submatrix corresponding to the inlinks of those pages that have already converged. Let us likewise split the current iterate of the PageRank vector x(k) into the mvector xN(k) corresponding to the components of x(k) that have not yet converged, and the (m − n)-vector xC(k) corresponding to the components of x(k) that have already converged. We may order A and x(k) as follows:

(k) xN x(k) = (6.1) xC(k) and AN A= . (6.2) AC

46

CHAPTER 6

function adaptivePR(G, x(0) , v) { repeat xN(k+1) = AN x(k) ; xC(k+1) = xC(k) ; [N, C] = detectConverged( x (k) , x(k+1) , ); periodically, δ = ||A x (k) − xk ||1 ; until δ < ; return x(k+1) ; } Algorithm 10: Adaptive PageRank

We may now write the next iteration of the Power Method as

(k+1) (k) AN xN xN = · (k) . (k+1) AC xC xC However, since the elements of xC(k) have already converged, we do not need to recompute xC(k+1) . Therefore, we may simplify each iteration of the computation to be xN(k+1) = AN x(k) ,

(6.3)

xC(k+1) = xC(k) .

(6.4)

The basic Adaptive PageRank algorithm is given in Algorithm 10. Identifying pages in each iteration that have converged is inexpensive. However, reordering the matrix A at each iteration is expensive. Therefore, we exploit the idea given above by periodically identifying converged pages and constructing AN . Since AN is smaller than A, the iteration cost for future iterations is reduced. We describe the details of the algorithm in the next section. 6.3.2 Filter-based Adaptive PageRank Since the web matrix A is several gigabytes in size, forming the submatrix AN needed in (6.3) will not be practical to do too often. Furthermore, if the entire matrix cannot be stored in memory, there is in general no efficient way to simply “ignore” the unnecessary entries (e.g., edges pointing to converged pages) in A if they are scattered throughout A. This is because the time spent reading the portions of the matrix to and from disk is generally much longer than the floating point operations saved. We describe in this section how to construct AN efficiently and periodically. Consider the following reformulation of the algorithm that was described in the previous section. Consider the matrix A as described in (6.2). Note that the submatrix AC is never actually used in computing x(k+1) . Let us define the matrix A as AN . (6.5) A = 0

47

ADAPTIVE PAGERANK

where we have replaced AC with an all-zero matrix of the same dimensions as AC . (k) Similarly, let us define x C as (k) 0 . (6.6) x C = xC(k) Now note that we can express an iteration of Adaptive PageRank as (k)

x(k+1) = A x(k) + x C .

(6.7)

Since A has the same dimensions as A, it seems we have not reduced the iteration cost; however, note that the cost of the matrix-vector multiplication is essentially given by the number of nonzero entries in the matrix, not the matrix dimensions.2 The above reformulation gives rise to the filter-based Adaptive PageRank scheme: if we can periodically increase the sparsity of the matrix A, we can lower the average iteration cost. Consider the set of indices C of pages that have been identified as having converged. We define the matrix A as 0 if i ∈ C, (6.8) Aij = Aij otherwise. In other words, when constructing A , we replace the row i in A with zeros if i ∈ C. Similarly, define xC as (k) (x )i if i ∈ C, (xC )i = (6.9) 0 otherwise. Note that A is much sparser than A, so that the cost of the multiplication A x is much less than the cost of the multiplication A x . In fact, the cost is the same as if we had an ordered matrix and performed the multiplication AN x. Now note that (k)

x(k+1) = A x(k) + x C

(6.10)

represents an iteration of the Adaptive PageRank algorithm. No special ordering of page identifiers was assumed. The key parameter in the above algorithm is how often to identify converged pages and construct the “compacted” matrix A ; since the cost of constructing A from A is on the order of the cost of the multiply A x , we do not want to apply it too often. However, looking at the convergence statistics given in Section 6.2, it is clear that even periodically filtering out the “converged edges” from A will be effective in reducing the cost of future iterations. The iteration cost will be lowered for three reasons: 1. reduced i/o for reading in the link structure 2. fewer memory accesses when executing Algorithm 1 3. fewer flops when executing Algorithm 1. 2 More

precisely, since the multiplication A x is performed using Algorithm 1 with the matrix P and the vector v, the number of nonzero entries in P determines the iteration cost. Note that subsequently, when we discuss zeroing out rows of A, this corresponds implementationally to zeroing out rows of the sparse matrix P .

48

CHAPTER 6

function filterAPR(G, x(0) , v) { repeat x(k+1) = A x(k) + xC(k) ; periodically, [N, C] = detectConverged( x (k) , x(k+1) , ); [A ] = filter(A , N, C); periodically, δ = ||A x (k) − xk ||1 ; until δ < ; return x(k+1) ; } Algorithm 11: Filter-based Adaptive PageRank

6.4 EXPERIMENTAL RESULTS In our experiments, we found that better speedups were attained when we ran Algorithm 11 in phases, where in each phase we begin with the original unfiltered version of the link structure, iterate a certain number of times (in our case 8), filter the link structure, and iterate some additional number of times (again 8). In successive phases, we reduce the tolerance threshold. In each phase, filtering using this threshold is applied once, after the 8th iteration. This strategy tries to keep all pages at roughly the same level of error while computing successive iterates to achieve some specified final tolerance. A comparison of the total cost of the standard PageRank algorithm and the Adaptive PageRank algorithm follows. Figure 6.4(a) depicts the total number of floating point operations needed to compute the PageRank vector to an L1 residual threshold of 10−3 and 10−4 using the Power Method and the Adaptive Power Method. The Adaptive algorithm operated in phases as described above using 10−2 , 10−3 , and 10−4 as the successive tolerances. Note that the adaptive method decreases the number of FLOPS needed by 22.6% and 23.6% in reaching final L1 residuals of 10−3 and 10−4 , respectively. Figure 6.4(b) depicts the total wallclock time needed for the same scenarios. The adaptive method reduces the wallclock time needed to compute the PageRank vectors by 14% and 15.2% in reaching final L1 residuals of 10−3 and 10−4 , respectively. Note that the adaptive method took 2 and 3 more iterations for reaching 10−3 and 10−4 , respectively, than the standard Power Method, as shown in Figure 6.4(c); however, as the average iteration cost was lower, speedup still occurred. 6.5 EXTENSIONS In this section, we present several extensions to the basic Adaptive PageRank algorithm that should further speed up the computation of PageRank. 6.5.1 Further Reducing Redundant Computation The core of the Adaptive PageRank algorithm is in replacing the matrix multiplication A x (k) with (6.3) and (6.4), reducing redundant computation by not recomputing the PageRanks of the pages in C (those pages that have converged).

49

ADAPTIVE PAGERANK 400

40 Standard Adaptive

350

30

300

25

250

Minutes

10^8 FLOPS

35

20 15

200 150

10

100

5

50

0

0

0.001 0.0001 Final L1 Residual Desired

Number of iterations

(a)

(c)

50 45 40 35 30 25 20 15 10 5 0

Standard Adaptive

(b)

0.001 0.0001 Final L1 Residual Desired

Standard Adaptive

0.001 0.0001 Final L1 Residual Desired

Figure 6.4 Experiments on LargeWeb dataset depicting total cost for computing the PageRank vector to an L1 residual threshold of 10−3 and 10−4 . (a) FLOPS; (b) wallclock time; (c) number of iterations.

In this section, we show how to further reduce redundant computation by not recomputing the components of the PageRanks of those pages in N due to links from those pages in C. More specifically, we can write the matrix A in (6.2) as

ANN A= ACN

ANC , ACC

where ANN are the links from pages that have not converged to pages that have not converged, ACN are links from pages that have converged to pages that have not converged, and so on. We may now rewrite (6.3) as follows: xN(k+1) = ANN xN(k) + ACN xC(k) . Since xC does not change at each iteration, the component ACN xC(k) does not change at each iteration. Therefore, we only need to recompute compute ACN xC(k) each time the matrix A is reordered. An algorithm based on this idea may proceed as in Algorithm 12. We expect that this algorithm should speed up the computation of PageRank even further than the speedups due to the standard Adaptive PageRank algorithm, but we have not investigated this empirically.

50

CHAPTER 6

function modifiedAPR(G, x(0) , v) { repeat periodically, [N, C] = detectConverged( x (k) , x(k+1) , ); (k) y = ACN xC ; xN(k+1) = ANN xN(k) + y; xC(k+1) = xC(k) ; periodically, δ = ||A x (k) − xk ||1 ; until δ < ; return x(k+1) ; } Algorithm 12: Modified Adaptive PageRank

6.5.2 Using the Matrix Ordering from the Previous Computation PageRank is computed several times per month, and it is likely that the convergence times for a given page will not change dramatically from one computation to the next. This suggests that the matrix A should be initially ordered by convergence time in the previous PageRank computation. That is, the page corresponding to i = 1 should be the page that took the longest to converge during the last PageRank computation, and the page corresponding to i = n should be the page that converged most quickly during the last PageRank computation. This should minimize the cost incurred by matrix reorders.

6.6 DISCUSSION In this chapter, we presented two contributions. First, we showed that most pages on the Web converge to their true PageRank quickly, while relatively few pages take much longer. We further showed that those slow-converging pages generally have high PageRank, and the pages that converge quickly generally have low PageRank. Second, we developed an algorithm, called Adaptive PageRank, that exploits this observation to speed up the computation of PageRank by 17% by avoiding redundant computation. Finally, we discussed several extensions of Adaptive PageRank that should further speed up PageRank computations. The next chapter presents an algorithm called BlockRank, which exploits yet another property of the Web, its intrahost versus interhost linkage patterns.

Chapter Seven BlockRank In this chapter, we exploit yet another observation about the Web matrix A to speed up PageRank. We observe that the Web link graph has a nested block structure; that is, the vast majority of hyperlinks link pages on a host to other pages on the same host, and many of those that do not link pages within the same domain. We show how to exploit this structure to speed up the computation of PageRank by BlockRank, a 3-stage algorithm whereby (1) the local PageRanks of pages for each host are computed independently using the link structure of that host, (2) these local PageRanks are then weighted by the “importance” of the corresponding host (calculated by running PageRank at the host-level), and (3) the standard PageRank algorithm is then run using as its starting vector the weighted concatenation of the local PageRanks. We begin the chapter by exploring the block structure of the Web.

7.1 BLOCK STRUCTURE OF THE WEB The key terminology we use here is given in Table 7.1. To investigate the structure of the Web, we run the following simple experiment. We take all the hyperlinks in Full-LargeWeb, and count how many of these are “intrahost” links (links from a page to another page in the same host) and how many are “interhost” links (links from a page to a page in a different host). Table 7.2 shows that 79.1% of the links in this dataset are intrahost links and 20.9% are interhost links. These intrahost connectivity statistics are not far from the earlier results of Bharat et al. [11]. We also investigate the number of links that are intradomain links, and the number of links that are interdomain links. Notice in Table 7.2 that an even larger majority of links are intradomain links (83.9%). These results make sense intuitively. Take as an example the domain cs.stanford. edu. Most of the links in cs.stanford.edu are links around the cs.stanford.edu site Table 7.1 Example illustrating our terminology using the sample url http://cs.stanford. edu/research/.

Term top level domain domain hostname host path

Example: cs.stanford.edu/research/ edu stanford.edu cs cs.stanford.edu /research/

52

CHAPTER 7

Table 7.2 Hyperlink statistics on LargeWeb for the full graph (Full: 291M nodes, 1.137B links) and for the graph with dangling nodes removed (DNR: 64.7M nodes, 607M links).

Full DNR

Intra Inter Intra Inter

Domain 953M links 83.9% 183M links 16.1% 578M links 95.2% 29M links 4.8%

Host 899M links 79.1% 237M links 20.9% 568M links 93.6% 39M links 6.4%

(such as cs.stanford.edu/admissions or cs.stanford.edu/research). Furthermore, a large percentage of non-navigational links are links to other Stanford hosts, such as www.stanford.edu, library.stanford.edu, or www-cs-students.stanford.edu. One might expect that this structure exists even in lower levels of the Web hierarchy. For example, one might expect that pages under cs.stanford.edu/admissions/ are highly interconnected, and very loosely connected with pages in other sublevels, leading to a nested block structure. This type of nested block structure can be naturally exposed by sorting the link graph to construct a link matrix in the following way. We sort urls lexicographically, except that as the sort key we reverse the components of the domain. For instance, the sort key for the url www.stanford.edu/ home/students/ would be edu.stanford.www/home/students. The urls are then assigned sequential identifiers when constructing the link matrix. A link matrix contains as its (i, j )th entry 1 if there is a link from i to j , and 0 otherwise. This arrangement has the desired property that urls are grouped in turn by top level domain, domain, hostname, and finally path. The subgraph for pages in stanford.edu appears as a sub-block of the full link matrix. In turn, the subgraph for pages in www-db.stanford.edu appears as a nested sub-block. We can then gain insight into the structure of the Web by using dotplots to visualize the link matrix. In a dotplot, if there exists a link from page i to page j then point (i, j ) is colored; otherwise point (i, j ) is white. Since our full datasets are too large to see individual pixels, we show several slices of the web in Figure 7.1. Notice three things: 1. There is a definite block structure to the Web. 2. The individual blocks are much smaller than the entire Web. 3. There are clear nested blocks corresponding to domains, hosts, and subdirectories. Figure 7.1(a) shows the dotplot for the ibm.com domain. Notice that there are clear blocks, which correspond to different hosts within ibm.com; for example, the upper left block corresponds to the almaden.ibm.com hosts (the hosts for IBM’s Almaden Research Center). Notice that the pages at the very end of the plot (pages i ≥ 18544) are heavily inlinked (the vertical line at the lower right-hand corner of the plot). These are the pages within the www.ibm.com host, and it is expected that they should be heavily inlinked. Also notice that there are 4 patterns that look like the upside-down letter “L.” These are sites that have a shallow hierarchy; the

53

BLOCKRANK 0

× 104

× 105

0

2

1

4 2

6 8

3

10 4 12 14

5

16

6

18 0

2

4

6

8 10 12 nz = 744250

14

16 18 × 104

0

(a) IBM 0

1

2

3 4 nz = 7583376

5

6 × 105

(b) Stanford/Berkeley

× 104

0

1

50

2

100

3

150

4

200

5

250

6 300 7 350

8

400

9

450

10 0

1

2

3

4 5 6 7 nz = 1525653

(c) Stanford-50

8

9 10 × 104

0

100

200 300 nz = 6375

400

(d) Stanford/Berkeley Host Graph

Figure 7.1 A view of four different slices of the Web: (a) the IBM domain; (b) all the hosts in the Stanford and Berkeley domains; (c) the first 50 Stanford domains, alphabetically; and (d) the host-graph of the Stanford and Berkeley domains.

root node links to most pages in the host (horizontal line), and is linked to by most pages in the host (vertical line), but there is not much interlinkage within the site itself (empty block). Finally, notice that the area around the diagonal is very dense; this corresponds to strong intrablock linkage, especially in the smaller blocks. Figure 7.1(b) shows the dotplot for Stanford/Berkeley. Notice that this also has a strong block structure and a dense diagonal. Furthermore, this plot makes clear the nested block structure of the web. The superblock on the upper left-hand side is the stanford.edu domain, and the superblock on the lower right-hand side is the berkeley.edu domain.

54

CHAPTER 7

Figure 7.1(c) shows a closeup of the first 50 hosts alphabetically within the stanford.edu domain. The majority of this dotplot is composed of three large hosts: acomp.stanford.edu, the academic computing site at Stanford, in the upper lefthand corner; cmgm.stanford.edu, an online bioinformatics resource, in the middle; and daily.stanford.edu, the Web site for the Stanford Daily (Stanford’s student newspaper) in the lower right-hand corner. There are many interesting structural motifs in this plot. First there is a long vertical line in the upper left-hand corner. This feature corresponds to the web site http://acomp.stanford.edu; most pages in the acomp.stanford.edu host point to this root node. Also, there is a clear nested block structure within acomp.stanford.edu on the level of different directories in the url hierarchy. In the Stanford Daily site, we see diagonal lines, long vertical blocks, a main center block, and short thin blocks. The first several web pages in daily.stanford.edu represent the front page of the paper for the past several days. Each front page links to the front page of the day before, and therefore there is a small diagonal line in the upper left-hand corner of the Stanford Daily block. The diagonals are due to the url naming convention of the Stanford Daily, which causes the lexicographic ordering of urls to induce a chronological ordering of the articles. The front pages link to articles, which are the middle pages of the block. Therefore, we see a horizontal strip in the top middle. These articles also link back to the front pages, and so we see a vertical strip on the middle left-hand side. The articles link to each other, since each article links to related articles and articles by the same author. This accounts for the square block in the center. The long vertical strips represent pages on the standard menu on each page of the site (such as the “subscriptions” page, the “write a letter to the editor” page, and the “advertising” page). Finally, the diagonal lines that surround the middle block are pages such as “e-mail this article to a friend” or “comment on this article,” which are linked to only one article each. Figure 7.1(d) shows the host graph for the stanford.edu and berkeley.edu domains, in which each host is treated as a single node, and an edge is placed between host i and host j if there is a link between any page in host i and host j . Again, we see strong block structure on the domain level, and the dense diagonal shows strong block structure on the host level as well. The vertical and horizontal lines near the bottom right-hand edge of the Stanford and Berkeley domains represent the www.stanford.edu and www.berkeley.edu hosts, showing that these hosts are, as expected, strongly linked to most other hosts within their own domain. 7.1.1 Block Sizes We investigate here the sizes of the hosts on the Web. Figure 7.2(a) shows the distribution over number of (crawled) pages of the hosts in LargeWeb. Notice that the majority of sites are small, on the order of 100 pages. Figure 7.2(b) shows the sizes of the host-blocks in the Web when dangling nodes are removed. When dangling nodes are removed, the blocks become smaller, and the distribution is still skewed toward small blocks. The largest block had 6,000 pages. In future sections we show how to exploit the small sizes of the blocks, relative to the dataset as a whole, to speed link analysis.

55

BLOCKRANK 0.8

0.35

0.7

0.3

0.6

0.25

0.5

0.2

0.4 0.15

0.3

0.1

0.2

0.05

0.1 0

0

(a)

1

2

3 4 5 FULL-LARGEWEB

6

0

7 (b)

0

0.5

1

1.5 2 2.5 DNR-LARGEWEB

3

3.5

4

Figure 7.2 Histogram of distribution over host sizes of the Web. The x-axis gives bucket sizes for the log10 of the size of the host-blocks, and the y-axis gives the fraction of host-blocks that are that size.

7.1.2 The GeoCities Effect While one would expect that most domains have high intracluster link density, as in cs.stanford.edu, there are some domains that one would expect to have low intracluster link density, for example pages.yahoo.com (formerly www.geocities.com). The Web site http://pages.yahoo.com is the root page for Yahoo! GeoCities, a free Web hosting service. There is no reason to think that people who have Web sites on GeoCities would prefer to link to one another rather than to sites not in GeoCities.1 Figure 7.3 shows a histogram of the intradomain densities of the Web. In Figure 7.3(a) there is a spike near 0% intrahost linkage, showing that many hosts are not very intraconnected. However, when we remove the hosts that have only 1 page (Figure 7.3(b)), this spike is substantially dampened, and when we exclude hosts with fewer than 5 pages, the spike is eliminated. This shows that the hosts in LargeWeb that are not highly intraconnected are very small in size. When the very small hosts are removed, the great majority of hosts have high intra-host densities. It should be noted that this test does not exclude the possibility of nested subhosts. For example, each user in GeoCities may have a highly intralinked area within GeoCities. In fact, this is observed to be the case with most freehosts. Thus, algorithms for subdividing hosts based on url path and intra-level density would be useful for the BlockRank algorithm. However, this is outside the scope of this book.

7.2 BLOCKRANK ALGORITHM We now present the BlockRank algorithm that exploits the empirical findings of the previous section to speed up the computation of PageRank. This work is motivated 1 There

may of course be deeper structure found in the path component, although we currently do not directly exploit such structure.

56

CHAPTER 7

0.25

0.08 0.07

0.2

0.06 0.05

0.15

0.04 0.1

0.03 0.02

0.05

0.01 0 −20

0

20

40 60 (a) All hosts

80

100

120

0 −20

0 20 40 60 80 100 (b) Hosts with 1 page excluded

120

0.12 0.1 0.08 0.06 0.04 0.02 0 −20

0 20 40 60 80 100 (c) Hosts with < 5 pages excluded

120

Figure 7.3 Distribution over interconnectivity of host-blocks for the DNR-LargeWeb dataset. The x-axis shows percentile buckets for intrahost linkage density (the percent of edges originating or terminating in a given host that are intrahost links), and the y-axis shows the fraction of hosts that have that linkage density. (a) for all hosts; (b) for all hosts that have more than 1 page; (c) for all hosts that have 5 or more pages.

by and builds on aggregation/disaggregation techniques [58, 65] and domain decomposition techniques [30] in numerical linear algebra. Steps 2 and 3 of the BlockRank algorithm are similar to the Rayleigh-Ritz refinement technique [51]. 7.2.1 Overview of BlockRank Algorithm The block structure of the Web suggests a fast algorithm for computing PageRank, wherein a “local PageRank vector” is computed for each host, giving the relative importance of pages within a host. These local PageRank vectors can then be used to form an approximation to the global PageRank vector that is used as a starting vector for the standard PageRank computation. This is the basic idea behind the BlockRank algorithm, which we summarize here and describe in this section. The algorithm proceeds as follows: 0. Split the Web into blocks by domain. 1. Compute the local PageRanks for each block (Section 7.2.2).

57

BLOCKRANK

2. Estimate the relative importance, or “BlockRank” of each block (Section 7.2.3). 3. Weight the local PageRanks in each block by the BlockRank of that block, and concatenate the weighted local PageRanks to form an approximate global PageRank vector z (Section 7.2.4). 4. Use z as a starting vector for standard PageRank (Section 7.2.5). We describe the steps in detail below, and we introduce some notation here. We use lower-case indices (i, j ) to represent indices of individual Web sites, and upper-case indices (I , J ) to represent indices of blocks. We use the shorthand notation i ∈ I to denote that page i ∈ block I . The number of elements in block J is denoted nJ . The graph of a given block J is given by the nJ × nJ submatrix GJ J of the matrix G. 7.2.2 Computing Local PageRanks In this section, we describe computing a “local PageRank vector” for each block in the Web. Since most blocks have very few links in and out of the block as compared to the number of links within the block, it seems plausible that the relative rankings of most of the pages within a block are determined by the inter-block links. We define the local PageRank vector lJ of a block J (GJ J ) to be the result of the PageRank algorithm applied only on block J , as if block J represented the entire Web, and as if the links to pages in other blocks did not exist. That is, lJ = PageRank(GJ J , sJ , vJ ), where the start vector sJ is the nJ × 1 uniform probability vector over pages in block J ([1/nJ ]n×1 ), and the personalization vector vJ is the nJ × 1 vector whose elements are all zero except the element corresponding to the root node of block J , whose value is 1. (A uniform vector may also be used as the personalization vector for local PageRank computations. It does not affect the final computed PageRank vector, and the difference in computation time is negligible.) Local PageRank accuracies To investigate how well these local PageRank vectors approximate the relative magnitudes of the true PageRank vectors within a given host, we run the following experiment. We compute the local PageRank vectors lJ of each host in Stanford/Berkeley. We also compute the global PageRank vector x for Stanford/Berkeley using the standard PageRank algorithm whose personalization vector v is a uniform distribution over root nodes. We then compare the local PageRank scores of the pages within a given host to the global PageRank scores of the pages in the same host. Specifically, we take the elements corresponding to the pages in host J of the global PageRank vector x, and form the vector gJ from these elements. We normalize gJ so that its elements sum to 1 in order to compare it to the local PageRank

58

CHAPTER 7

Table 7.3 The “closeness” as measured by average (a) absolute error, and (b) KDist distance of the local PageRank vectors lJ and the global PageRank segments gJ , compared to the closeness of uniform vectors vJ and the global PageRank segments gJ for the Stanford/Berkeley dataset.

Approximation lJ

Error Measure ||lJ − gJ ||1 KDist(lJ , gJ )

Average Value

vJ

|| vJ − gJ ||1 KDist( vJ , gJ )

1.2804 0.8369

0.2383 0.0571

vector lJ , which also has an L1 norm of 1. Specifically, gJ = x(j ∈ J )/|| x (j ∈ J )||1 . We call these vectors gJ normalized global PageRank segments, or global PageRank segments for short. The results are summarized in Table 7.3. The absolute error ||lJ − gJ ||1 is on average 0.2383 for the hosts in Stanford/Berkeley. We compare the error of the local PageRank vectors lj to the error of a uniform vJ − gJ ||1 is on average 1.2804 vJ = [1/nJ ]n×1 vector for each host J . The error || for the hosts in Stanford/Berkeley. One can see that the local PageRank vectors are much closer to the global PageRank segments than the uniform vectors are. So a concatenation of the local PageRank vectors may form a better start vector for the standard PageRank iteration than the uniform vector. The relative ordering of pages within a host induced by local PageRank scores is generally close to the intrahost ordering induced by the global PageRank scores. To compare the orderings, we measure the average “distance” between the local PageRank vectors lJ and global PageRank segments gJ . using the Kendall’s-τ rank correlation measure. The average distance kτ (lJ , gJ ) is 0.0571 for the hosts in Stanford/Berkeley. Notice that this distance is small. This observation means that the ordering induced by the local PageRank is close to being correct, and thus suggests that the majority of the L1 error in the comparison of local and global PageRanks comes from the miscalibration of a few pages on each host. Indeed the miscalibration may be among important pages; as we discuss next, this miscalibration is corrected by the final step of our algorithm. Furthermore, the relative rankings of pages on different hosts is unkown at this point. For these reasons, we do not suggest using local PageRank for ranking pages; we use it only as a tool for computing the global PageRank more efficiently. Table 7.4 confirms this observation for the host aa.stanford.edu. Notice that the ordering is preserved, and a large part of the discrepancy is due to http://aa.stanford. edu. The local PageRank computation gives too little weight to the root node. Since the elements of the local vector sum to 1, the ranks of all other pages are upweighted.

59

BLOCKRANK

Table 7.4 The local PageRank vector lJ for the domain aa.stanford.edu (left) compared to the global PageRank segment gJ corresponding to the same pages. The local PageRank vector has a similar ordering to that of the normalized components of the global PageRank vector. The discrepancy in actual ranks is largely due to the fact that the local PageRank vector does not give enough weight to the root node http://aa.stanford.edu. Web Page http://aa.stanford.edu http://aa.stanford.edu/aeroastro/AAfolks.html http://aa.stanford.edu/aeroastro/AssistantsAero.html http://aa.stanford.edu/aeroastro/EngineerAero.html http://aa.stanford.edu/aeroastro/Faculty.html http://aa.stanford.edu/aeroastro/FellowsAero.html http://aa.stanford.edu/aeroastro/GraduateGuide.html http://aa.stanford.edu/aeroastro/Labs.html http://aa.stanford.edu/aeroastro/Links.html http://aa.stanford.edu/aeroastro/MSAero.html http://aa.stanford.edu/aeroastro/News.html http://aa.stanford.edu/aeroastro/PhdAero.html http://aa.stanford.edu/aeroastro/aacourseinfo.html http://aa.stanford.edu/aeroastro/aafaculty.html http://aa.stanford.edu/aeroastro/aalabs.html http://aa.stanford.edu/aeroastro/admitinfo.html http://aa.stanford.edu/aeroastro/courseinfo.html http://aa.stanford.edu/aeroastro/draftcourses.html http://aa.stanford.edu/aeroastro/labs.html http://aa.stanford.edu/aeroastro/prospective.html http://aa.stanford.edu/aeroastro/resources.html http://aa.stanford.edu/aeroastro/visitday.html

Local 0.2196 0.0910 0.0105 0.0081 0.0459 0.0081 0.1244 0.0387 0.0926 0.0081 0.0939 0.0081 0.0111 0.0524 0.0524 0.0110 0.0812 0.0012 0.0081 0.0100 0.0112 0.0123

Global 0.4137 0.0730 0.0048 0.0044 0.0491 0.0044 0.0875 0.0454 0.0749 0.0044 0.0744 0.0044 0.0039 0.0275 0.0278 0.0057 0.0713 0.0003 0.0044 0.0063 0.0058 0.0068

Local PageRank convergence rates Another interesting question to investigate is how quickly the local PageRank scores converge. In Figure 7.4, we show a histogram of the number of iterations it takes for the local PageRank scores for each host in DNR-LargeWeb to converge to an L1 residual < 10−1 . Notice that most hosts converge to this residual in fewer than 12 iterations. Interestingly, there is no correlation between the convergence rate of a host and the host’s size. Rather, the convergence rate is primarily dependent on the extent of the nested block structure within the host. That is, hosts with strong nested blocks are likely to converge slowly, since they represent slow-mixing Markov chains. Hosts with a more random connection pattern converge faster, since they represent a fast-mixing Markov chain. This observation suggests that one could make the local PageRank computations even faster by wisely choosing the blocks. That is, if a host has a strong nested block structure, use the directories within that host as your blocks. However, this is not a crucial issue, because we show in Chapter 7.3 that the local PageRank computations can be performed in a distributed fashion in parallel with the crawl. Therefore, reducing the cost of the local PageRank computations is not a bottleneck for computing PageRank with our scheme, as the local computations can be pipelined with the crawl.2 2 Generally this requires a site-based crawler (such as the WebBase crawler [18]) which maintains a pool

of active hosts and crawls hosts to completion before adding new hosts to the pool.

60

CHAPTER 7

0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

2

4

6

8

10

12

14

16

18

20

Figure 7.4 Local PageRank convergence rates for hosts in DNR-LargeWeb. The x-axis is the number of iterations until convergence to a tolerance of 10−1 , and the y-axis is the fraction of hosts that converge in a given number of iterations.

7.2.3 Estimating the Relative Importance of Each Block In this section, we describe computing the relative importance, or BlockRank, of each block. Assume there are k blocks in the Web. To compute BlockRanks, we first create the block graph B, where each vertex in the graph corresponds to a block in the Web graph. An edge between two pages in the Web is represented as an edge between the corresponding blocks (or a self-edge, if both pages are in the same block). The edge weights are determined as follows: the weight of an edge between blocks I and J is defined to be the sum of the edge-weights from pages in I to pages in J in the Web graph, weighted by the local PageRanks of the linking pages in block I . That is, if A is the Web graph and li is the local PageRank of page i in block I , then the weight of edge BI J is given by BI J = Aij · li . i∈I,j ∈J

We can write this in matrix notation as follows. Define the local PageRank matrix L to be the n × k matrix whose columns are the local PageRank vectors lJ : ⎞ ⎛ l1 0 · · · 0 ⎜ 0 l2 · · · 0 ⎟ ⎟ ⎜ L = ⎜. . . ⎟. . . ... ⎠ ⎝ .. .. 0 0 · · · lK

61

BLOCKRANK

Define the matrix S to be the n × k matrix that has the same structure as L, but whose nonzero entries are all replaced by 1. Then the block matrix B is the k × k matrix B = LT AS.

(7.1)

Notice that B is a transition matrix where the element BI J represents the transition probability of block I to block J . That is, BI J = p(J |I ). Once we have the k × k transition matrix B, we may use the standard PageRank That is: algorithm on this reduced matrix to compute the BlockRank vector b. b = PageRank(B, vk , vk ), where vk is the uniform k-vector [ k1 ]k×1 . Note that this process is the same as computing the stationary distribution of the transition matrix c·B +(1−c)Ek , where we define Ek = [1]k×1 × vk T . The analogy to the random surfer model of [56] is this: we imagine a random surfer going from block to block according to the transition probability matrix B. At each stage, the surfer will get bored with probability 1 − c and jump to a different block. 7.2.4 Approximating Global PageRank Using Local PageRank and BlockRank In this section, we find an estimate to the global PageRank vector x. At this point, we have two sets of rankings. Within each block J , we have the local PageRanks lJ of the pages in the block. We also have the BlockRank vector b whose elements bJ are the BlockRank for each block J , measuring the relative importance of the blocks. We may now approximate the global PageRank of a page j ∈ J as its local PageRank lj , weighted by the BlockRank bJ of the block in which it resides. That is, xj(0) = lj · bJ . In matrix notation, this is x(0) = Lb. Recall that the local PageRanks of each block sum to 1. The BlockRanks also sum to 1; therefore, our approximate global PageRanks will also sum to 1. The reasoning follows. The sum of of our approximate global PageRanks sum(xj ) = j xj can be written as a sum over blocks xj . sum(xj ) = J

j ∈J

Using our definition for xj from (7.2.4) lj bJ = bJ lJ . sum(xj ) = J

j ∈J

J

j ∈J

62

CHAPTER 7

Since the local PageRanks for each domain sum to 1 ( sum(xj ) =

j ∈J lj

= 1)

bJ .

J

And since the BlockRanks also sum to 1 (

J

bJ = 1)

sum(xj ) = 1. Therefore, we may use our approximate global PageRank vector x(0) as a start vector for the standard PageRank algorithm.

7.2.5 Using This Estimate as a Start Vector In order to compute the true global PageRank vector x from our approximate PageRank vector x(0) , we simply use it as a start vector for standard PageRank. That is: x = pageRank(G, x(0) , v), where G is the graph of the Web, and v is the uniform distribution over root nodes. In Section 7.6, we show how to compute different personalizations quickly once x has been computed. The BlockRank algorithm for computing PageRank, presented in the preceding sections, is summarized by Algorithm 13. 0. Sort the Web graph lexicographically as described in Section 7.1, exposing the nested block structure of the Web. 1. Compute the local PageRank vector lJ for each block J . foreach block J do lJ = pageRank(GJ J , sJ , vJ ); end 2. Compute block transition matrix B and BlockRanks b. B = LT AS b = pageRank(B, vk , vk ) 3. Find an approximation x(0) to the global PageRank vector x by weighting the local PageRanks of pages in block J by the BlockRank of J . x(0) = Lb 4. Use this approximation as a start vector for a standard PageRank iteration. x(0) = pageRank(G, x(0) , v) Algorithm 13: BlockRank Algorithm

BLOCKRANK

63

7.3 ADVANTAGES OF BLOCKRANK The BlockRank algorithm has four major advantages over the standard PageRank algorithm. Advantage 1 A major speedup of our algorithm comes from caching effects. All the host-blocks in our crawl are small enough that each block graph fits in main memory, and the vector of ranks for the active block largely fits in the CPU cache. As the full graph does not fit in main memory, the local PageRank iterations thus require less disk i/o than the global computations. The full rank vectors do fit in main memory; however, using the sorted link structure3 dramatically improves the memory access patterns to the rank vector. Indeed, if we use the sorted link structure, designed for BlockRank, as the input instead to the standard PageRank algorithm, the enhanced locality of reference to the rank vectors cuts the time needed for each iteration of the standard algorithm by more than half: from 6.5 minutes to 3.1 minutes for each iteration on DNR-LargeWeb! Advantage 2 In our BlockRank algorithm, the local PageRank vectors for many blocks will converge quickly; thus the computations of those blocks may be terminated after only a few iterations. This increases the effectiveness of the local PageRank computation by allowing it to expend more computation on slowly converging blocks, and less computation on faster converging blocks. Note, for instance, in Figure 7.4 that there is a wide range of rates of convergence for the blocks. In the standard PageRank algorithm, iterations operate on the whole graph; thus the convergence bottleneck is largely due to the slowest blocks. Much computation is wasted recomputing the PageRank of blocks whose local computation has already converged. Advantage 3 The local PageRank computations in Step 1 of the BlockRank algorithm can be computed in a completely parallel or distributed fashion. That is, the local PageRanks for each block can be computed on a separate processor, or computer. The only communication required is that, at the end of Step 1, each computer should send its local PageRank vector lj to a central computer that will compute the global PageRank vector. If our graph consists of n total pages, the net communication cost consists of 8n bytes (if using 8-byte double precision floating point values). Naive parallelization of the computation that does not exploit block structure would require a transfer of 8n bytes after each iteration, a significant penalty. Furthermore, the local PageRank computations can be pipelined with the web crawl. That is, the local PageRank computation for a host can begin as a separate process as soon as the crawler finishes crawling the host. In this case, only the costs of Steps 2–4 of the BlockRank algorithm become rate-limiting. 3 As

in Section 7.1, this entails assigning document ids in lexicographic order of the url (with the components of the full hostname reversed).

64

CHAPTER 7

Advantage 4 In several scenarios, the local PageRank computations (the results of Step 1) can be reused during future applications of the BlockRank algorithm. Consider for instance, news sites such as cnn.com that are crawled more frequently then the general Web. In this case, after a crawl of cnn.com, if we wish to recompute the global PageRank vector, we can rerun the BlockRank algorithm, except that in Step 1, only the local PageRanks for the cnn.com block need to be recomputed. The remaining local PageRanks will be unchanged and can be reused in Steps 2–3. In this way, we can also reuse the local PageRank computations for the case of computing several “personalized” PageRank vectors. We further discuss personalized PageRank in Section 7.6.

7.4 EXPERIMENTAL RESULTS In this section, we investigate the speedup of BlockRank compared to the standard algorithm for computing PageRank. The speedup of our algorithm for typical scenarios comes from the first three advantages listed in Section 7.3. The speedups are due to less expensive iterations, as well as fewer total iterations. (Advantage 4 is discussed in subsequent sections.) We begin with the scenario in which PageRank is computed after the completion of the crawl; we assume that only Step 0 of the BlockRank algorithm is computed concurrently with the crawl. As mentioned in Advantage 1 from the previous section, the improved reference locality due to blockiness exposed by lexicographically sorting the link matrix, achieves a speedup of a factor of two in the time needed for each iteration of the standard PageRank algorithm. This speedup is completely independent of the value chosen for c, and does not affect the rate of convergence as measured in number of iterations required to reach a particular L1 residual. If instead of the standard PageRank algorithm, we use the BlockRank algorithm on the block structured matrix, we gain the full benefit of Advantages 1 and 2; the blocks each fit in main memory, and many blocks converge more quickly than the convergence of the entire Web. We compare the wallclock time it takes to compute PageRank using the BlockRank algorithm in this scenario, where local PageRank vectors are computed serially after the crawl is complete, with the wallclock time it takes to compute PageRank using the standard algorithm given in [56]. Table 7.5 gives the running times of the 4 steps of the BlockRank algorithm on the LargeWeb dataset. The first 3 rows of Table 7.6 give the wallclock running times for standard PageRank, standard PageRank using the url-sorted link matrix, and the full BlockRank algorithm computed after the crawl. We see a small additional speedup for BlockRank on top of the previously described speedup. Subsequently, we will describe a scenario in which the costs of Steps 1–3 become largely irrelevant, leading to further effective speedups. In this next scenario, we assume that the cost of Step 1 can be made negligible in one of two ways: the local PageRank vectors can be pipelined with the web crawl, or they can be computed in parallel after the crawl. If the local PageRank vectors

65

BLOCKRANK

Table 7.5 Running times for the individual steps of BlockRank for c = 0.85 in achieving a final residual of ric && rik ≥ 0}; while A = ∅ do find peer x with max rix ∈ A; if connectx (i) = true then disconnectc (i); return from function; end else A = A \ x; end end if ric < 0 then disconnectc (i); connectToRandomPeers(); return from function; end // Rest same as Algorithm 21; end } function connect(peer j ) { if | N(i) |< τ then if sij ≥ 0 && rij ≥ 0 then return true // accept connection to peer j ; end end else find neighbor k with min sik ; if sik < sij && rij ≥ 0 then disconnectk (i); return true // accept connection to peer j ; end end return false // connection denied; } function disconnect(peer j ) { N(i) = N(i) \ j ; if numLostConnections(i) > threshold then drop all current connections; connectToRandomPeers(); reset numLostConnections(i); end else drop neighbor k with lowest rik ; connectToRandomPeers(); increment numLostConnections(i); end }

Algorithm 22: APT Protocol Connection Trust Extension

115

ADAPTIVE P2P TOPOLOGIES

Table 10.1 Connection Variables.

local trust connection trust void downloads

peer score based on download transaction neighbor score based on the outcome of a query forwarded by the neighbor. number of queries with no response.

a random connection. This will address the new nodes problem at the expense of some efficiency and robustness. A detailed analysis of hybrid approaches is outside the scope of this book.

10.4 EMPIRICAL RESULTS In this section, we assess the performance of the proposed scheme and compare it to a P2P network with a standard power-law topology. The performance analysis is given for standard conditions as well as a variety of threat models. The success of the APT protocol is dependent on the variables a peer uses to determine its connections to other peers. Equipped with the variables listed in Table 10.1, a peer is able to avert malicious attacks and choose, from its acquaintances, peers that share similar interests. Metrics. We are interested in measuring the efficiency, security, and incentives of Adaptive P2P Topologies. Two metrics that measure the efficiency of the network are the number of messages passed and the number of authentic responses per query. We define the total number of messages to be the number of queries, responses, and file downloads in the network. We define the number of authentic responses as those query responses that would lead to the download of an authentic file. The number of inauthentic responses per query and the average characteristic path length to malicious peers are used to evaluate the security of our protocol. The number of inauthentic responses per query represents the total number of query responses originating from a malicious peer. The characteristic path length to a peer is the average hop-count to the peer from all other peers in the network. We use the average characteristic path length to active peers and freeriders to measure the incentives that our protocol gives for participation. 10.4.1 Malicious Peers Move to Fringe 10.4.1.1 Principle One common attack on P2P networks today is the inauthentic file attack, where malicious peers upload corrupt, inauthentic, or misnamed content onto the network. Since the APT protocol disconnects from peers that upload unsatisfactory files, malicious peers eventually move to the fringe of the network. Moving malicious peers to the fringe of the network has been shown to be a very effective strategy in combatting certain types of attacks [23].

116

CHAPTER 10

5 4.5

Characteristic Path Length

4 3.5 3 2.5 2 1.5 1 0.5 0 0

10

20

30

40

50 60 Peer ID

70

80

90

100

110

Figure 10.2 Average characteristic path (cycle 0).

Experiments We use the characteristic path length to illustrate how the malicious peers move to the fringe of the network using the APT protocol. We define the average characteristic path length to a peer i as the average of the shortest path lengths between all other peers in the network and peer i. cpli =

1 shortestPath(i, j ) |P \i| j ∈P \i

where, P \i is the set of all peers in the network except peer i. In Figure 10.2, the x-axis represents the peer id and the y-axis is the characteristic path length to that peer. The data were extracted from the Query-Cycle Simulator at Cycle 0, which describes a power-law topology. The x-axis is segmented into two regions, each reflecting a type of peer in the network. The gray region represents peers in the network that are considered good and the black malicious peers. The average characteristic path length to any peer is about 2.5 hops. During network bootstrap, malicious peers were encoded to aggressively connect to good peers, and thus the average characteristic path length for malicious peers is 2.3 hops. Figure 10.3 shows the characteristic path length after 95 simulated cycles. The overall average characteristic path length is 4.94 hops. Notice that the average characteristic path to a malicious peer (9.84 hops) is much larger than that to a good peer

117

ADAPTIVE P2P TOPOLOGIES

Characteristic Path Length

Infinity

10

5

0

0

10

20

30

40

50 60 Peer ID

70

80

90

100 110

Figure 10.3 Average characteristic path (cycle 95).

Inf

Characteristic Path Length

Good Peer Malicious Peer

10

5

0

0

20

40

60 Cycle

80

100

Figure 10.4 Characteristic path length (cycle 0–100).

(4.02 hops). The separation in characteristic path length shows that as good peers drop connections based on inauthentic downloads, malicious peers are pushed into the fringe of the network. Figure 10.4 plots the evolution of the average characteristic path length for good peers and malicious peers. Beyond the initial 100 cycles, no path exists from a good peer to a malicious peer. The divergence in the characteristic path lengths of good

118

CHAPTER 10

3.5

Characteristic Path Length

3 2.5 2 1.5 1 0.5 0

0

50 100 Authentic Uploads

150

Figure 10.5 Characteristic path length based on file uploads.

and malicious peers indicates that these two peer types become distinct over time. A good peer can take advantage of this result by limiting the scope of its search query so that a malicious peer never receives its query. Section 10.4.4 describes a strategy to avoid malicious query responses by reducing the TTL for a peer’s search query. 10.4.2 Freeriders Move to Fringe Principle When peer i finds a peer that it is likely to download from, it connects to that peer and disconnects from its neighbor with the lowest local trust score. Since a freerider has a local trust score of 0, freeriders move to the fringe of the network as well. Experiments As shown in Figure 10.3, certain good peers have higher than average characteristic path lengths. These peers are freeriders, and their path length reflects the fact that it is not advantageous to connect to peers that do not share any files. Notice also that the path length to any freerider is still shorter than the path length to any malicious peer. Since freeriders are not actively uploading inauthentic files, other peers remain connected to them until they find a more desirable peer. Figure 10.5 is an equidistant histogram of the characteristic path length for all good peers after a 100 cycle simulation. The x-axis lists four buckets representing the number of uploads a peer has provided. Peers with no uploads (freeriders) fall

119

ADAPTIVE P2P TOPOLOGIES

12

Number of Connections

10

8

6

4

2

0 0

200

400 600 800 Shared Data Volume

1000

1200

Figure 10.6 Connections versus shared data volume.

into the bucket labeled 0, while a peer that has uploaded N authentic files falls into N × 50. For example, peer i with N = 60 uploaded files is counted bucket B = 50 in bucket B = 100. The figure shows that freeriders take an average of 3.4 hops to reach while peers with uploads take around 2 hops. Therefore, the peers that do not share files are given a narrow view of the network. 10.4.3 Active Peers Are Rewarded Principle Active peers have more opportunities to connect to other active peers, since their local trust scores will be high. For example, an active peer i with τ = 3 may have connections with local trust values of 10, 6, and 4. An inactive peer will not have the opportunity to connect to peer i, while an active peer that has provided more than 4 authentic files to that peer will. Thus, the reward for being an active peer is the opportunity to connect directly to other active peers. Experiments The number of connections a peer has relative to the number of files the peer shares is plotted in Figure 10.6. To make the results more apparent, the network was set up to limit the maximum number of connections τ to 11 (in this experiment only). Notice the trend that as the number of shared files increases so does the number of connections. This shows that peers that share many files are rewarded with a wider view of the network.

120

CHAPTER 10

20

Number of Connections

18 16 14 12 10 8 6 4 2 0

0

50 100 Authentic Uploads

150

Figure 10.7 Connections versus authentic uploads.

A peer is rewarded for sharing high quality files. In Figure 10.6 there are peers that share large amounts of data but receive only few connections in return. These peers were set to share many unpopular files. Since connections are derived from downloads, peers are not rewarded for sharing vast amounts of unpopular data. One indication of a peer that shares popular files is the number of executed uploads. Figure 10.7 plots the number of connections a peer has relative to the number of authentic uploads it has performed. The graph clearly shows that peers that actively upload files are rewarded with a wide view of the network. Figure 10.8 shows the fraction of authentic responses received by a peer compared to the total number of authentic files it uploads. For example, on average 43% of all query responses returned to a peer that executed 50 uploads originated from a good peer. Notice that peers with more than 65 authentic file uploads receive only authentic responses. This shows that an active peer is further rewarded with connections to peers with high fidelity. Section 10.4.4 describes clusters or neighborhoods that contain peers with similar interests and quality of service. These clusters become unreachable to a malicious peer residing on the fringe of the network. 10.4.4 Efficient Topology Principle In the APT protocol, connections are made based on download history, so peers connect to peers that share their interests. Eventually, clusters of like-minded peers form and are connected to other clusters by hub peers which are active and have many interests. These characteristics describe a small-world network [70]. Such a network is sparsely connected, but has a short average characteristic path length

121

ADAPTIVE P2P TOPOLOGIES

Authentic Response Ratio

1

0.8

0.6

0.4

0.2

0

0

50 100 Authentic Uploads

150

Figure 10.8 Authentic response ratio.

and a high cluster coefficient. A small-world network thus allows a wide view of the network with low communication overhead. Experiments The data shown in Figures 10.9, 10.10, and 10.11 are extracted from the same simulated session. In this experiment, peers follow the APT protocol and messages are set with a 3-hop TTL. Furthermore, malicious peers were encoded to aggressively flood the network with their query responses by responding to all queries they receive. Throughout the experiment, the average overall characteristic path length to a peer was about 2.9 hops and the network diameter was between 7 and 9 hops. The total number of query and response messages transferred during a given cycle is plotted in Figure 10.9. The decline in network traffic can be attributed to good peers dropping connections to malicious peers, since malicious peers are unable to respond to queries as they lose their grasp on the network. Figure 10.10 shows that the number of authentic responses returned to a peer increases over time. The result supports the fact that peers connect to other good peers that share similar content, and consequently more queries are being answered. At 70 cycles, the average number of messages transferred per cycle is about 100,000, which is 1/3 of the number of messages transferred during the first cycle. Furthermore, Figure 10.11 shows that 97% of all responses after 70 cycles are authentic. A small TTL setting takes advantage of the fact that malicious peers move to the fringe of the network. The number of malicious responses to a query during a given cycle under three separate TTL settings, max_ttl = 4, med_ttl = 3, and min_ttl = 2 is given in Figure 10.12. Since a malicious peer is on average 3 hops

122

CHAPTER 10

3.5

x 10

5

Messages Transferred

3 2.5 2 1.5 1 0.5 0

10

20

30

40

50 60 Cycle

70

80

90

100

70

80

90

100

Figure 10.9 Network traffic.

400

Authentic Responses

350 300 250 200 150 100 50 0

0

10

20

30

40

50 60 Cycle

Figure 10.10 Authentic responses.

further away than a good peer, there is a significant decline in malicious query responses when using med_ttl or min_ttl. Therefore, setting the TTL to be less than the overall average characteristic path can be effective in reducing the number of malicious query responses.

123

ADAPTIVE P2P TOPOLOGIES

Authentic Response Ratio

1

0.8

0.6

0.4

0.2

0

0

10

20

30

40

50 60 Cycle

70

80

90

100

Figure 10.11 Authentic response ratio.

180

Malicious Query Responses

MAX TTL

160

MED TTL

140

MIN TTL

120 100 80 60 40 20 0

0

5

10

15

20

25 30 Cycle

35

40

45

50

Figure 10.12 Malicious query responses based on TTL.

In the previous figures we have shown that even as the number of messages passed decreases, peers still receive good quality of service. One reason is that malicious peers are moved to the fringe of the network, thereby decreasing the unnecessary message overhead caused by malicious responses. Another reason is

124

CHAPTER 10

Cluster Coefficient

0.5 0.4 0.3 0.2 0.1 0

0

10

20

30

40

50 60 Cycle

70

80

90

100

Figure 10.13 Cluster coefficient.

that peers are organized into clusters of peers that share similar interests, so the files that are of interest to a peer are likely located nearby. We use the average cluster coefficient to quantify the clustering effect of the APT protocol. The local cluster coefficient Ci for peer i ∈ P with ki neighbors is defined as: Ci =

2Ei , ki (ki − 1)

where Ei is the actual number of edges that exist between the ki neighbors. Using this definition, if peer i and all peers in N (i) form a clique then Ci = 1. The average cluster coefficient is then defined as Ci averaged over all peers in P . Figure 10.13 measures the average cluster coefficient for all peers in the network. The increase in cluster coefficient from the initial power-law topology at cycle 0 shows that clusters form using the APT protocol. These clusters consist of peers that have had many positive interactions and thus share similar content interests. Figures 10.14 and 10.15 demonstrate this clustering effect by measuring the link ratio. The link ratio is defined as the percentage of peers assigned a particular value that are also neighbors. Figure 10.14 measures the link ratio with respect to local trust values. For example, edge (i, j ) exists in 29% of all cases where peer i has a local trust value of 25 for peer j . Notice the low ratio for local trust scores below 5 and the high ratio for scores above 25. Since local trust scores define successful transaction, the plot shows that a peer’s connections are determined by its transactions with other peers.

125

ADAPTIVE P2P TOPOLOGIES

1 0.9 0.8

Link Ratio

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

5

10 15 20 Local Trust Value

25

30

Figure 10.14 Link ratio based on local trust.

1 0.9 0.8

Link Ratio

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

0.1

0.2

0.3

0.4 0.5 0.6 0.7 Content Similarity

0.8

0.9

1

Figure 10.15 Link ratio based on content similarity.

In Figure 10.15, the link ratio with respect to the similarity of the peers’ content is plotted. The content similarity of peer i and peer j is defined as 1 − ( nt=1 |cit − cj t |) S(i, j ) = , 2

126

CHAPTER 10

P

G peer m

Figure 10.16 Node model for threat models A and B.

where n is the total number of content categories and cit the number of files peer i shares in content category t normalized by the total number of files shared by peer i. Using this definition, the value S(i, j ) = 1 means that peer i and peer j have the same content distribution vectors c. That is, cit = cj t for each content category t. The graph shows that clusters form out of peers having the majority of their content similar. Clustering based on content similarity increases the probability that a query is answered within a few hops. Moreover, responses to queries from closer peers can be trusted more than responses from farther away. By lowering its query horizon, a peer can take advantage of the clustering that occurs under the APT protocol, and thereby increase the overall network efficiency. 10.5 THREAT SCENARIOS In previous sections it was assumed that malicious peers simply flood the network with inauthentic files in an attempt to subvert the system. We now evaluate the performance of our protocol in preventing malicious connections (connections leading to a malicious peer) and inauthentic file downloads under a variety of threat scenarios. 10.5.1 Threat Model A In this model, malicious peers respond to all queries except those issued by a neighbor. If a malicious peer is chosen as a download source, it will upload an inauthentic file. Node model. Let G ⊂ P be a set of good peers and m be a malicious peer directly connected to all peers in G. The peers in G also hold connections to peers in G ⊂ {P ∪ m} as shown in Figure 10.16. Query model. All queries received by peer m are handled by the following two cases. 1. If the query originated by some peer in G the then peer m does not respond. 2. If the query originated by some peer in P \ {G ∪ m} then peer m responds according to the model presented in Chapter 8.

127

ADAPTIVE P2P TOPOLOGIES

70

Malicious Connections

60 50 40 30 20 10 0

0

20

40

60

80

100 120 140 160 180 200 Cycle

Figure 10.17 Threat model A malicious connections.

Instead of propagating the query to its neighbors, after potentially responding, peer m drops the query. Peer m does not generate search queries of its own. Given this query model, peer m only uploads inauthentic files to peers in P \ {G ∪ m}. Consequently, peer m avoids inauthentic file detection by a neighboring peer in G. That is, peer i ∈ G will have a zero value local trust score for peer m. The goal of the malicious peer is to prevent connection drops due to negative local trust scores. Threat model A is naturally combated by the APT protocol. At some point peer i ∈ G begins to notice numerous peers disconnecting from it, and so assumes it is relaying queries to a malicious peer. The connection loss triggers peer i to replace a low trust value connection with some random connection, as outlined in Algorithm 22. The question now is whether peer i makes the right decision by dropping the connection to peer m. To answer this observe that a malicious peer behaves as a freerider, which makes it a less than desirable connection. According to Algorithm 22, after a connection loss, a peer drops its connections to peers with low local trust scores and reconnects to random peers. Since malicious peer m will have a local trust score of 0, it is more likely to be dropped. Figure 10.17 shows the progression of malicious connections for a simulated session set up for threat model A. The early drop in the number of malicious connections is the result of swapping out low value connections to malicious peers. The short life of a malicious connection evenly distributes the inauthentic content placed in the network by a malicious peer. This even distribution works against the malicious peer, since more peers are made aware of its intent. As the simulation moves forward, malicious peers encounter resistance in making new connections.

128

CHAPTER 10

After cycle 190, all connection requests made by malicious peers are denied by good peers. Thus threat model A is handled well by the APT protocol. 10.5.2 Threat Model B In threat model B, the malicious peer entices its neighbors with a few authentic files in order to gain some local trust and evade connection loss. The malicious peer will likely want to minimize the number of authentic file uploads due to the counterproductive cost. That is, the purpose of a malicious peer is to disrupt the sharing of authentic files, not support it. Node model. Let G ⊂ P be a set of good peers and m be a malicious peer directly connected to all peers in G. Peers in G are then connected to peers in G ⊂ {P ∪ m} as shown in Figure 10.16. Query model. Peer m responds to queries as specified by the model in Chapter 8 subject to the constraint that no more than 0.1 of all uploads are to peers in G. Instead of propagating the query, peer m drops it. Peer m does not generate a search query of its own. Under threat model B, peer m will service authentic files to peers in G and upload inauthentic files to peers in P \ {G ∪ m}. By servicing peers in G with authentic files, peer m increases its chances of maintaining its connection to P \ {G ∪ m}, via G. Threat model B is thwarted by the connection trust extension described in Section 10.3.3. According to Algorithm 22 the peers in G will eventually lose their connections with peers in P \ {G ∪ m} due to poor connection trust scores caused by relaying messages to peer m. The loss of the connections to P \{G ∪ m} lessens the value of peers in G to peer m. Peer m will likely try to form connections with peers in P \ {G ∪ m}. However, reconnecting becomes increasingly difficult since peer m will have built up negative local trust scores with peers in P \ {G ∪ m}. Assuming that the majority of uploads by peer m are inauthentic (as in any productive malicious attack), peer m will eventually lose the ability to connect to good peers in P . Figure 10.18 plots the succession of malicious connections in a simulated session under Threat model B. The graph starts out similar to that shown in Figure 10.17. However, in later cycles the number of malicious connections remains around 5, while in Figure 10.17 all malicious peers are completely disconnected. Nevertheless, at this point, most queries never reach the malicious peers since the average path length to them is 4 hops more than to a good peer. Consequently, as shown in Figure 10.19, the number of inauthentic downloads is negligible. A considerable amount of noise is present in Figures 10.18 and 10.19, which can be attributed to the volatile nature of connections to malicious peers and the authentic files uploaded by malicious peers. The noise attrition in both figures is caused by the increased resistance toward malicious connections. Peers that remain connected to malicious peers do so because they are serviced authentic files. These peers are tagged as bad connections because of poor connection trust scores. Dropping peers with low connection trust scores closes the conduit to the malicious peers. The malicious peers may then seek other, more fruitful

129

ADAPTIVE P2P TOPOLOGIES

80

Malicious Connections

70 60 50 40 30 20 10 0

0

50

100

150 Cycle

200

250

300

Figure 10.18 Threat model B malicious connections.

50 45 Inauthentic Downloads

40 35 30 25 20 15 10 5 0

0

50

100

150 Cycle

200

250

300

Figure 10.19 Threat model B inauthentic downloads.

connections. However, poor local trust scores make it difficult for malicious peers to form new connections, and thus they are trapped in their current connections. If malicious peers are not able to continually satisfy queries, then by Algorithm 22, the connections will be severed due to void downloads. Although not as effective as

130

CHAPTER 10

P

Ma Mi

Figure 10.20 Node model for threat model C.

in Threat model A, the APT protocol is able to prevent most inauthentic downloads and keep malicious peers at bay. 10.5.3 Threat Model C In Threat Model C, a set of malicious peers that upload inauthentic files are connected to another set of malicious peers that provide authentic files. The malicious peers serving authentic files maintain the connection to the rest of the network while the others flood the network with inauthentic files. Node model. Figure 10.20 illustrates the node model for threat model C. Let M ⊂ P be the set of all malicious peers in P . We partition M into two disjoint sets Ma and Mi . Peers in Mi are connected to all peers in Ma and malicious peers in Ma also maintain connections to peers in P \ {Ma ∪ Mi }. Query model. Both sets of malicious peers will respond to queries according to the model presented in Chapter 8. The search queries received by peers in Ma are only forwarded to peers in Mi , while peers in Mi do not propagate the queries they receive. Neither of the malicious sets will generate a search query of their own. After being chosen as a download source, peers in Ma upload authentic files, while those in Mi upload inauthentic files. The motivation behind threat model C is twofold: 1. Malicious peers no longer depend on servicing a large number of requests made by a single good peer. 2. The connection trust is inherently weaker than local trust at preventing malicious connection. This threat model exploits that weakness by using malicious peers as conduits that purely upload authentic files to the rest of the network. The main advantage of this threat model over the previous is that peers in Ma are never assigned a negative local trust score. Therefore, good peers rely entirely on negative connection trust scores to stave off malicious connections from peers in Ma .

131

ADAPTIVE P2P TOPOLOGIES

25

3 Hidden 5 Hidden 7 Hidden

Malicious Connections

20

15

10

5

0

0

10

20

30

40 50 Cycle

60

70

80

90

Figure 10.21 Threat model C malicious connections.

Figure 10.21 shows three simulated sessions using different partitions of malicious peers in Mi (hidden peers) and Ma , where | Mi | + | Ma |= 10. As expected, the malicious attacks containing more hidden peers have a shorter connection life to peers in P \ {Ma ∪ Mi }. Notice also that in all cases malicious peers are eventually discovered and disconnected from the network. Therefore, the connection trust is sufficient in eliminating malicious peers from the network.

10.6 RELATED WORK Related topologies have been proposed in [19], [50], and [67]. In [19], a peer connects to peers initially at random, and disconnects when it becomes overloaded. The peers that are disconnected will then connect to other peers. The connections in [19] can be search links (through which search information is sent) or index links (through which indexing links are sent). Lv et al. present a similar scheme in [50]; the differences here are that there is only one type of link and that each peer tracks its neighbors’ capacities and suggests a replacement peer after it breaks a connection. The SLIC mechanism proposed in [67] does not add or break connections, but rather allows each peer to rate its neighbors, and use these ratings to control how many queries from each neighbor to process and forward. More generally, much work has been done in P2P network topologies for efficient search. In particular, distributed hash tables (DHTs) have been introduced [66] that mapping content to specific nodes, so that queries could be routed directly to the node at which the content is stored. However, DHTs do not address the issues of robustness to malicious peers, nor do they provide a framework by which more

132

CHAPTER 10

extended personalization can be achieved (i.e. by content recommendation systems). This is the first work that directly addresses personalization as a goal in topology design.

10.7 DISCUSSION In this chapter, we have shown a simple peer-level protocol to form a personalized P2P networks by giving each user proximity to the content that is most likely to be of interest. The resulting topologies are highly efficient, robust to malicious attacks, and provide built-in incentives and punishments that are consistent with positive peer contribution. As each peer chooses its neighbors, clusters of peers with similar interests and quality of service form. The creation of communities containing congenial peers have implications toward the further personalization of P2P networks, such as distributed recommendation systems, social-network based search engines, and social question-answering systems. In conjunction with EigenTrust, these topologies introduce the notions of reputation and personalization into search in P2P networks, much as PageRank and personalized PageRank do for the web.

Chapter Eleven Conclusion As large quantities of data are becoming more accessible via the WWW and P2P file-sharing networks, search is beginning to play a vital role in today’s society. As such, it is important to continue to improve the quality of a user’s search experience by identifying and intelligently exploiting “hidden” information (or signals) in these databases. One such signal is user context information. This book examined the scalable use of user context information to improve search quality in both the WWW and P2P networks. We have presented scalable algorithms and detailed mathematical analysis for personalized search in both domains. There are several future challenges for the personalization of search. User interface design is one of the key challenges in personalized search. Some of the key issues here are non-intrusiveness and consistency of experience. It is important not to have personalization degrade the quality of the user’s search experience, and likewise, it is important that a user’s overall experience maintain a consistency even despite her changing interests. Ideally, personalized search should be as seamless and transparent as unpersonalized search. All of these considerations point to a subtle introduction of personalization into search results. Another challenge in personalized search is the utilization of implicit measures of user preferences. In the spirit of making the user experience seamless and transparent, a user should not have to make his interests explicit. There are several signals that could be used to indicate a user’s preferences, for example, their search or browsing history. Correctly using these implicit measures of user preferences remains a fertile area for research in personalization and machine learning. It would also be interesting and useful to explore approaches to personalized search other than link-based methods. Personalization methods that exploit text on a page offer much promise, as do methods that exploit user visit behavior where that information is available. There are broad privacy implications of personalized search in a world where search plays a central role in daily life. My own opinion is that any implementation of personalization where the search engine collects user data should build in transparency (the search engine should show the user all the data it uses to personalize search results), control (the user should be able to delete data that she doesn’t want used in personalization, and data portability (any user data collected by the search engine should belong to the user, not the search engine, and the user should be able to download that data and bring it to another search engine). An in-depth discussion on privacy and personalization should be the topic of its own book and is outside the scope of this technical work.

134

CHAPTER 11

There is immense potential for personalized search not just in ranking, but in many other aspects of the search engine, such as user interfaces, corpora selection in federated search, and different models of “push search,” like recommendations and alerts. With Web-enabled mobile phones more common, impoverished input devices will usher in the need for personalization in query formulation and latency reduction. And with the proliferation of social networks, social search will become a more dominant method of search. All of these will require novel approaches, novel algorithms, and novel mathematical underpinnings. Indeed, the scope of future work in this area is broad, and, in the context of today’s data-driven society, the potential value is tremendous.

Bibliography

[1] Gnutella website. http://www.gnutella.com. [2] Karl Aberer and Zoran Despotovic. Managing trust in a peer-to-peer information system. In Proceedings of the 10th International Conference on Information and Knowledge Management (ACM CIKM), 2001. [3] Advogato’s Trust Metric (White Paper). http://www.advogato.org/trustmetric.html. [4] A. C. Aitken. On Bernoulli’s numerical solution of algebraic equations. Proc. Roy. Soc. Edinburgh, 46:289–305, 1926. [5] Amazon website. www.amazon.com. [6] Amy Langville and Carl Meyer. Google’s PageRank and Beyond: The Science of Search Engine Rankings. Princeton University Press, 2006. [7] Arvind Arasu, Jasmine Novak, Andrew Tomkins, and John Tomlin. PageRank computation and the structure of the Web: Experiments and algorithms. In Proceedings of the Eleventh International World Wide Web Conference, Poster Track, 2002. [8] Tuomas Aura, Pekka Nikander, and Jussipekka Leiwo. DoS-resistant authentication with client puzzles. In Eighth International Workshop on Security Protocols, 2000. [9] T. Beth, M. Borcherding, and B. Klein. Valuation of trust in open networks. In Proceedings of the Third European Symposium on Research in Computer Security — ESORICS ’94, 1994. [10] Patrick Vinograd Beverly Yang and Hector Garcia-Molina. Evaluating GUESS and non-forwarding peer-to-peer search. In ICDCS, 2004. [11] Krishna Bharat, Bay-Wei Chang, Monika Henzinger, and Matthias Ruhl. Who links to whom: Mining linkage between Web sites. In Proceedings of the IEEE International Conference on Data Mining, November 2001. [12] Krishna Bharat and Monika R. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of the ACM-SIGIR, 1998. [13] Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, and Janet Wiener. Graph

136

BIBLIOGRAPHY

structure in the Web. In Proceedings of the Ninth International World Wide Web Conference, 2000. [14] Captcha Project. http://www.captcha.net. [15] Soumen Chakrabarti, Byron Dom, David Gibson, Jon Kleinberg, Prabhakar Raghavan, and Sridhar Rajagopalan. Automatic resource compilation by analyzing hyperlink structure and associated text. In Proceedings of the Seventh International World Wide Web Conference, 1998. [16] Soumen Chakrabarti, Mukul M. Joshi, Kunal Punera, and David M. Pennock. The structure of broad topics on the Web. In Proceedings of the Eleventh International World Wide Web Conference, 2002. [17] Soumen Chakrabarti, Martin van den Berg, and Byron Dom. Focused crawling: A new approach to topic-specific Web resource discovery. In Proceedings of the Eighth International World Wide Web Conference, 1999. [18] Junghoo Cho and Hector Garcia-Molina. Parallel crawlers. In Proceedings of the Eleventh International World Wide Web Conference, 2002. [19] Brian F. Cooper and Hector Garcia-Molina. Ad hoc, self-supervising peer-topeer search networks. Stanford University Technical Report, 2003. [20] Fabrizio Cornelli, Ernesto Damiani, Sabrina De Capitani Di Vimercati, Stefano Paraboschi, and Pierangda Samarati. Choosing reputable servents in a P2P network. In Proceedings of the 11th World Wide Web Conference, 2002. [21] Arturo Crespo and Hector Garcia-Molina. Semantic overlay networks for P2P systems. Stanford University Technical Report, 2002. [22] Arturo Crespo and Hector Garcia-Molina. Routing indices for P2P systems. In Proceedings of the 28th Conference on Distributed Computing Systems, July 2002. [23] Neil Daswani and Hector Garcia-Molina. Query-flood DoS attacks in Gnutella. In ACM CCS, 2002. [24] John R. Douceur. The Sybil attack. In First IPTPS, March 2002. [25] eBay website. www.ebay.com. [26] Wolfgang Nejdl et al. EDUTELLA: A P2P networking infrastructure based on RDF. In Proceedings of the Eleventh World Wide Web Conference, May 2002. [27] Ronald Fagin, Ravi Kumar, and D. Sivakumar. Comparing top k lists. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 2003. [28] Michael J. Freedman and Radek Vingralek. Efficient P2P lookup for a distributed trie. In First International Workshop on P2P Systems, 2002. [29] Gene H. Golub and Chen Greif. An Arnoldi-type algorithm for computing PageRank. BIT Numerical Mathematics, 46(4):759–771, 2006.

BIBLIOGRAPHY

137

[30] Gene H. Golub and Charles F. Van Loan. Matrix Computations. Johns Hopkins University Press, 1996. [31] Geoffrey R. Grimmett and David R. Stirzaker. Probability and Random Processes. Oxford University Press, 1989. [32] Ramanathan Guha, Ravi Kumar, Prabhakar Raghavan, and Andrew Tomkins. Propagation of trust and distrust. In Proceedings of the Thirteenth International World Wide Web Conference, 2004. [33] Taher H. Haveliwala. Efficient computation of PageRank. Stanford University Technical Report, 1999. [34] Taher H. Haveliwala. Topic-sensitive PageRank. In Proceedings of the Eleventh International World Wide Web Conference, 2002. [35] Taher H. Haveliwala and Sepandar D. Kamvar. The second eigenvalue of the Google matrix. Stanford University Technical Report, 2003. [36] Jun Hirai, Siran Raghavan, Hector Garcia-Molina, and Andreas Paepcke. WebBase: A repository of web pages. In Proceedings of the Ninth International World Wide Web Conference, 2000. [37] Marius Iosifescu. Finite Markov Processes and Their Applications. John Wiley, 1980. [38] Dean L. Isaacson and Richard W. Madsen. Markov Chains: Theory and Applications, chapter IV, pages 126–127. John Wiley, 1976. [39] Glen Jeh and Jennifer Widom. Scaling personalized web search. In Proceedings of the Twelfth International World Wide Web Conference, 2003. [40] Sepandar D. Kamvar, Taher H. Haveliwala, Christopher D. Manning, and Gene H. Golub. Extrapolation methods for accelerating pagerank computations. In Proceedings of the Twelfth International World Wide Web Conference, 2003. [41] Sepandar D. Kamvar, Mario T. Schlosser, and Hector Garcia-Molina. The EigenTrust algorithm for reputation management in P2P networks. In Proceedings of the Twelfth International World Wide Web Conference, 2003. [42] Sepandar D. Kamvar, Mario T. Schlosser, and Hector Garcia-Molina. Incentives for combatting freeriding on p2p networks. In Euro-Par 2003, 2003. [43] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. Science, 220, 4598:671–680, 13 May 1983. [44] J. Kleinberg, S. R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. The Web as a graph: Measurements, models, and methods. In Proceedings of the International Conference on Combinatorics and Computing, 1999.

138

BIBLIOGRAPHY

[45] Jon Kleinberg. Authoritative sources in a hyperlinked environment. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 1998. [46] Robert R. Korfhage. Information Storage and Retrieval. John Wiley, 1997. [47] Samuel Kotz and Norman L. Johnson. Encyclopedia of Statistical Sciences, chapter Kendall’s Tau, pages 367–369. John Wiley, 1993. [48] Udo R. Krieger. Numerical solution of large finite Markov chains by algebraic multigrid techniques. In Proceedings of the 2nd International Workshop on the Numerical Solution of Markov Chains, 1995. [49] Amy Langville and Carl Meyer. Deeper inside PageRank. Internet Mathematics, 2004. [50] Qin Lv, Sylvia Ratsnasamy, and Scott Shenker. Can heterogeneity make Gnutella scalable? In First International Workshop on P2P Systems, 2002. [51] D. F. McAllister, G. W. Stewart, and W. J. Stewart. On a Rayleigh-Ritz refinement technique for nearly uncoupled stochastic matrices. Linear Algebra and Its Applications, 60:1–25, 1984. [52] Alberto Medina, Ibrahim Matta, and John Byers. On the origin of power laws in internet topologies. Technical report, Boston University Computer Science Department, April 2000. [53] Marina Meila and Jianbo Shi. A random walks view of spectral segmentation. In AI and Statistics (AISTATS), 2001. [54] Carl D. Meyer. Sensitivity of the stationary distribution of a Markov chain. SIAM Journal on Matrix Analysis and Applications, 15(3):715–728, 1994. [55] Andrew Y. Ng, Alice X. Zheng, and Michael I. Jordan. Link analysis, eigenvectors and stability. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 903–910, 2001. [56] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation ranking: Bringing order to the Web. Stanford Digital Libraries Working Paper, 1998. [57] Pietro Perona and William T. Freeman. A factorization approach to grouping. In Proceedings EECV, 1998. [58] Pierre-Jacques Courtois. Queueing and Computer System Applications. Academic Press, 1977. [59] Davood Rafiei and Alberto O. Mendelzon. What is this page known for? Computing Web page reputations. In Proceedings of the Ninth International World Wide Web Conference, 2000.

BIBLIOGRAPHY

139

[60] Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp, and Scott Shenker. A scalable content-addressable network. In Proceedings of ACM SIGCOMM, 2001. [61] Paul Resnick, Richard Zeckhauser, Eric Friedman, and Ko Kuwabara. Reputation systems. Communications of the ACM, 43(12):45–48, 2000. [62] Matthew Richardson and Pedro Domingos. The intelligent surfer: Probabilistic combination of link and content information in PageRank. In Advances in Neural Information Processing Systems, volume 14. MIT Press, 2002. [63] Matei Ripeanu and Ian Foster. Mapping the Gnutella network — macroscopic properties of large-scale P2P networks. IEEE Internet Computing Journal, 6(1), 2002. [64] Stefan Saroiu, P. Krishna Gummadi, and Steven D. Gribble. A measurement study of peer-to-peer file sharing systems. In Proceedings of Multimedia Computing and Networking 2002 (MMCN ’02), 2002. [65] Herbert A. Simon and Albert Ando. Aggregation of variables in dynamic systems. Econometrica, 29:111–138, 1961. [66] Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. In Proceedings of the 2001 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, 2001. [67] Qi Sun and Hector Garcia-Molina. SLIC: A selfish link-based incentive mechanism for unstructured peer-to-peer networks. In Proceedings of the 17th International Conference on Distributed Computing (DISC), 2003. [68] Lloyd N. Trefethen and David Bau. Numerical Linear Algebra. SIAM, 1997. [69] VBS.Gnutella Worm. http://securityresponse.symantec.com/avcenter/venc/ data/vbs.gnutella.html. [70] Duncan J. Watts and Steven H. Strogatz. Collective dynamics of “smallworld” networks. Nature, 393:440–442, 1998. [71] Bryce Wilcox-O’Hearn. Experiences deploying a large-scale emergent network. In First International Workshop on P2P Systems, 2002. [72] J. H. Wilkinson. The Algebraic Eigenvalue Problem. Oxford University Press, 1965. [73] P. Wynn. On the convergence and stability of the epsilon algorithm. SIAM Journal of Numerical Analysis, 33:91–122, 1966. [74] Beverly Yang and Hector Garcia-Molina. Improving efficiency of P2P search. In Proceedings of the 28th Conference on Distributed Computing Systems, July 2002.