High-Performence Computing - Parallel, Distributed, and Cache-Conscious Algorithm Design and Analysis, Multithreaded Algorithms, Prefix Sums, Tree Contraction, Work-Efficient Parallel BFS, Graph Separation, Partitioning, Connectivity, Supercomputing

192 54 17MB

English Pages [251]

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

High-Performence Computing - Parallel, Distributed, and Cache-Conscious Algorithm Design and Analysis, Multithreaded Algorithms, Prefix Sums, Tree Contraction, Work-Efficient Parallel BFS, Graph Separation, Partitioning, Connectivity, Supercomputing

Table of contents :
1 Introduction
2 Breadth-First Search Overview
2.1 Preliminaries
2.2 Parallel BFS: Prior Work
3 Breadth-First Search on Distributed Memory Systems
3.1 BFS with 1D Partitioning
3.2 BFS with 2D Partitioning
4 Implementation Details
4.1 Graph Representation
4.2 Local Computation
4.3 Distributed-memory parallelism
4.4 Load-balancing traversal
5 Algorithm Analysis
5.1 Analysis of the 1D Algorithm
5.2 Analysis of the 2D Algorithm
6 Experimental Studies
7 Conclusions and Future Work
8 References
Introduction
Background on BSP and Pregel
Systems Tested
Giraph
GPS
Mizan
GraphLab
Algorithms
PageRank
SSSP
WCC
DMST
Evaluation Methodology
System Setup
Datasets
Algorithms
Evaluation Metrics
Experimental Results
Summary of Results
Giraph
GPS
Mizan
GraphLab
Results for LJ and OR Datasets
Experiences
Giraph
GPS
Mizan
GraphLab
Conclusion
References

Citation preview

RESEARCHCONTf?ll3UllONS Algorithms and Data Structures David Shmoys Editor

The Input/Output Complexity of Sorting and Related Problems ALOK AGGARWAL and JEFFREYSCOTT VlllER

We provide tight upper and lower bounds, ABSTRACT: up to a constant factor, for the number of inputs and outputs (I/OS) between internal memo y and seconda y storage required for five sorting-related problems: sorting, the fast Fourier transform (FFT), permutation networks, permuting, and matrix transposition. The bounds hold both in the worst case and in the average case, and in several situations the constant factors match. Secondary storage is modeled as a magnetic disk capable of transferring P blocks each containing B records in a single time unit; the records in each block must be input from or output to B contiguous locations on the disk. We give two optimal algorithms for the problems, which are variants of merge sorting and distribution sorting. In particular we show for P = 1 that the standard merge sorting algorithm is an optimal external sorting method, up to a constant factor in the number of I/OS. Our sorting algorithms use the same number of I/OS as does the permutation phase of key sorting, except when the internal memo y size is extremely small, thus affirming the popular adage that key sorting is not faster. We also give a simpler and more direct derivation of Hong and Kung’s lower bound for the FFT for the special case B = P = O(1). 1. INTRODUCTION

The problem of how to sort efficiently has strong practical and theoretical merit and has motivated many studies in the analysis of algorithms and computational complexity. Recent studies [8] confirm that sorting continues to account for roughly one-fourth of all computer cycles. Much of those resources are consumed by external sorts, in which the file is too large to fit in internal memory and must reside in secondary storage (typically on magnetic disks). It is well documented that the bottleneck in external sorting is the time for input/output (I/O) between internal memory and secondary storage. 01988 AC:M OOOl-0782/88/0900-1116

1116

Communications of the ACM

51.50

Sorts of extremely large size are becoming more and more common. For example, banks each night typically sort the checks of the current day into increasing order by account number. Then the accounting files can be updated in a single linear pass through the sorted file. In many cases, banks are required to complete this processing before opening for the next business day. Lindstrom and Vitter [8] point out that a typical sort from a few years ago might involve a file of two million records, totaling 800 megabytes, and take l-i: hours; but in the near future typical file sizes are expected to contain ten million records, totaling 10,000 megabytes, and current sorting methods would take most of one day to do the sorting. (Banks would then have trouble completing this processing before the next business day!) Two alternatives for coping with this problem present themselves. One approach is to relax the problem requirements and to investigate alternate computer architectures such as parallel or distributed systems, as done in [8], for example. The other approach. which we take in this article, is to examine the fundamental limits in terms of the number of I/OS for external sorting and related problems in current computing environments. We assume that there is a single central processing unit, and we model secondary storage as a generalized random-access magnetic disk. (For completeness, we also consider the case in which the disk has some parallel capabilities.) Our parameters are N = # records to sort; M = # records that can fit into internal memory; B = # records that can be transferred in a single block; P = # blocks that can be transferred concurrently;

where 1 5 B 5 M < N and 1 5 P : LM/BJ. We denote the N records by RI, Rz, . . . , RN. The parameters N, M, and B are referred to as the file size, memory size, and

September 1988

Volume 31

Number 9

Research Contributions

block size, respectively. Typical parameters for the two sorting examples mentioned earlier are N = 2 X lo’, M = 2000, B = 100, P = 1, and N= lo’, M = 3000, B = 50, P = 1.

Each block transfer is allowed to access any contiguous group of B records on the disk. Parallelism is inherent in the problem in two ways: Each block can transfer B records at once, which models the wellknown fact that a conventional disk can transfer a block of data via an I/O roughly as fast as it can transfer a single bit. The second factor is that there can be P block transfers at the same time, which partly models special features that the disk might potentially have, such as multiple I/O channels and read/write heads and an ability to access records in noncontiguous locations on disk in a single I/O. Pioneering work in this area was done by Floyd [3], who demonstrated matching upper and lower bounds of O((N log N)/B) I/OS for the problem of matrix transposition for the special case P = O(l), B = O(M) = @(NC), where c is a constant 0 C c < 1. Floyd’s lower bound for transposition also applied to the problems of permuting and sorting (since they are more general problems), and the bound matched the number of I/OS used by merge sort. For these restricted values of M, B, and P, the bound showed that essentially n(log N) passesare needed to sort the file (since each pass takes O(N/B) I/OS), and that merge sorting and the permutation phase of key sorting both perform the optimum number of I/OS. However, for other values of B, M, and P, Floyd’s upper and lower bounds did not match, thus leaving open the general question of the I/O complexity of sorting. In this article we present optimal bounds, up to a constant factor, for all values of M, B, and P for the following five sorting-related problems: sorting, fast Fourier transform (FFT), permutation networks, permuting, and matrix transposition. We show that under mild restrictions the constant factors implicit in our upper and lower bounds are often equal. The five problems are similar, but the lower bounds require different bents, which illustrate precisely the relation of the five problems to one another. The upper bounds can be obtained by a variant of merge sort with P-block lookahead forecasting and by a distribution-sorting algorithm that uses a median finding subroutine. In particular, we can conclude that the dominant part of sorting, in terms of the number of I/OS, is the rearranging of the records, not determining their order, except when M is extremely small with respect to N. Thus, the permutation phase of key sorting typically requires as many I/OS as does general sorting. Our results answer the pebbling questions posed in [Q] concerning the optimum I/O time needed to perform the computation implied by the FFT directed graph (also called the butterfly or shuffle-exchange or Omega network). For lagniappe, we also give a simple direct proof of the lower bound for FFT when B = P = O(l), which was previously proved in [a] using a complicated pebbling argument.

September 1988

Volume 31

Number 9

2. PROBLEM

DEFINITIONS

We can picture the internal memory and secondary storage disk together as extended memory, consisting of a large array containing at least M + N locations, each location capable of storing a single record. We arbitrarily number the M locations in internal memory by x[l], 421, . . . , x[M] and the locations on the disk by x[M + 11, x[M + 21, . . . . The five problems can be phrased as follows: Sorting Problem Instance:

The internal memory is empty, and the N records reside at the beginning of the disk; that is, x[i] = nil, for 1 5 i 5 M, and x[M + i] = R,, for lsi(N.

Goal: The internal memory is empty, and the N records reside at the beginning of the disk in sorted nondecreasing order; that is, x[i] = nil, for 1 5 i i M, and the records in x[M + 11, x[M + 21, . . . , x[M + N] are ordered in nondecreasing order by their key values. Fast Fourier Transform (FFT) Problem Instance: Let N be a power of 2. The internal

memory is empty, and the N records reside at the beginning of the disk; that is, x[i] = nil, for 1 5 i 5 M, and x[M + i] = Ri, for 1 5 i 5 N. Goal: The N output nodes of the FFT directed graph (digraph) are “pebbled” (to be explained below) and the memory configuration is exactly as in the original problem instance. The FFT digraph and its underlying recursive construction are shown in Figure 1 for the case N = 16. It consists of log N + 1 columns each containing N nodes; column 0 contains the N input nodes, and column log N contains the N output nodes. (Unless explicitly speciA,........................

\.............,...*..*.*.

*....*..

:,C

. . . . . . . . ‘I.

FIGURE1. The FFT digraph for N = 16. Column 0 on the left contains the N input nodes, and column log N on the right contains the N output nodes. All edges are directed from left to right. The N-input FFf digraph can be recursively decomposed into two N/24nput FFf digraphs A and I3 followed by one extra column of nodes to which the output nodes of A and B are connected in a shuffle-like fashion.

Communications of the ACM

1117

Research Contributions

fied, the base of the logarithm is 2.) Each non-input node has indegree 2, and each non-output node has outdegree 2. The FFT digraph is also known as the butterfly or shuffle-exchange or Omega network. We shall denote the ith node (0 5 i 5 N - 1) in column j (0 5 j 5 log N) in the FFT digraph by n,,j. The two predecessors to node ni,j are nodes ni,j-1 and niezf-l,j-1, where @ denotes the exclusive-or operation. (Note that nodes n,,, and niez/m’,jeach have the same two predecessors). The ith node in each column corresponds to record Ri. We can pebble node ni,j if its two predecessors have already been pebbled and if the records corresponding to those two predecessors both reside in internal memory. Intuitively, the FFT problem can be phrased as the problem of pumping the records into and out of internal memory in a way that permits the computation implied by the FFT digraph. Permutation

Network

The Problem Instance and Goal are phrased the same as for FFT, except that the permutation network digraph is pebbled rather than the FFT digraph. A complete description of permutation networks appears in [5]. A permutation network digraph consists of J + 1 columns, for some J zz log N, each containing N nodes. Column 0 contains the N input nodes, and column J contains the N output nodes. All edges are directed between adjacent columns, in the direction of increasing index. We denote the ith node in column j as ni.j. For each 1 5 i 5 N, 1 5 j 5 1, there is an edge from ni,j-1 to ni.j. In addition, 7zi.jcan have one other predecessor, call it ni,,j-1, but when that is the case there is also an edge from ni.j-1 to ni*,J;that is, nodes ni,j and ni,.j have the same two predecessors. In that case, we can think of there being a “switch” between nodes ni., and ni,.j that can be set either to allow the data from the previous column to pass through unaltered (that is, the data in node ni,,-, goes to ni,j and the data in ni,,,-, goes to ni,,,) or else to swap the data (so the data in ni,j-l goes to ni8.jand the data in nir.j-1 goes to ni,j). A digraph like this is called a permutation network if for each of the N! permutations p, , p2, . . . , pN we can set the switches in such a way to realize the permutation; that is, data at each input node ni.0 is routed to output node nr,,,. The ith node in each column corresponds to the current contents of record Ri, and we can pebble node ni,j if its predecessors have already been pebbled and if the records corresponding to those predecessors reside in internal memory. Permuting The Problem Instance and Goal are the same as for sort-

ing, except the key values of the N records are required to forrn a permutation of {l, 2, . . . , N). There is a big difference between permutation networks and general permuting. In the latter case, the particular I/OS performed may depend upon the desired permutation, whereas with permutation networks

1118

Communications of the ACM

all N! permutations can be generated by the same sequence of I/OS. Permuting is the second (and typically dominant) component of key sorting. The first component of key sorting consists of stripping away the key values of the records and sorting the keys. Ideally, the keys are small enough so that this sort can be done in internal memory and thus very quickly. In the second component of key sorting, the records are routed to their final positions based upon the ordering determined by the sorting of the keys. Matrix Transposition Problem Instance: A p X 4 matrix A = (Ai.j) o-f N = pq

records stored in row-major order on disk. The internal memory is empty, and the N records reside in rowmajor order at the beginning of the disk; that is, x[i] = nil, for 1 5 i I M, and x[M + 1 + i] = A1+Li/rJ,l+i-qli/qj, forOSiSN-1. Goal: The internal memory is empty, and the transposed matrix AT resides on disk in row-major order. (The 4 X p matrix AT is called the transpose of A if ATj = Aj,i, for all 1 5 i 5 4 and 1 5 j 5 p.) An equivalent formulation is for the original matrix A to reside in column-major order on disk. 3. THE

MAIN

RESULTS

Our model requires that each block transfer in an input can move at most B records from disk into internal memory, and that the transferred records must come from a contiguous segment x[M + i], x[M + i + 11, . . . , x[M + i + B - l] of B locations on the disk, for some i > O; similarly, in each output the transferred records must be deposited within a contiguous segment of B locations. We assume that the records are indivisible; that is, records are transferred in their entirety, and bit manipulations like exclusive-oring are not allowed. Our characterization of the I/O complexity for the five problems is given in the following three main theorems. The constant factors implicit in the bounds are discussed at the end of the section. THEOREM 3.1. The average-case and worst-case number of I/OS required for sorting N records and for computing the N-input FFT digraph is

o x

Ml

+ N/B)

PB log(l + M/B)



(3.1)

For the sorting lower bound, the comparison model is used, but only for the case when M and B are extremely small with respect to N, namely, when B log(1 + M/l?) = o(log(1 + N/B)). The average-case and worst-case number of I/OS required for computing any N-input permutation network is n

N

b(l

+

N/B)

PB log(1 + M/B) > ;

(3.2)

furthermore, there are permutation networks such that the number of I/OS needed to compute them is

September 1988

Volume 31

Number 9

Research Contributions

o

E

loid

+

N/B)

(3.31

PB log(1 + M/B)

The average-case and worst-case number of I/OS required to permute N records is THEOREM 3.2.

deposited into an empty location in internal memory; similarly, an output is simple if the transferred records are removed from internal memory and deposited into empty locations on disk. We denote the kth set (k 2 1) of B contiguous locations on the disk, namely, locations DEFINITION 3.2.

2

E

1wP

+

N/B)

P ’ PB log(1 + M/B)

(3.4)



It is interesting to note that the optimum bound for sorting in Theorem 3.1 matches the second of the two terms being minimized in Theorem 3.2. When the second term in (3.4) achieves the minimum, which happens except when M and B are extremely small with respect to N, the problem of permuting is as hard as the more general problem of sorting; the dominant component of sorting in this case, in terms of the number of I/OS, is the routing of the records, not the determination of their order. When instead M and B are extremely small (namely, when B log(1 + M/B) = o(log(1 + N/B))), the N/P term in (3.4) achieves the minimum, and the optimum algorithm for permuting is to move the records in the naive manner, one record per block transfer. This is precisely the case where advance knowledge of the output permutation makes the problem of permuting easier than sorting. The lower bound for sorting in Theorem 3.1 for this case requires the use of the comparison model. An interesting corollary comes from applying the bound for sorting in Theorem 3.1 to the case M = 2 and B = P = 1, where the number of I/OS corresponds to the number of comparisons needed to sort N records by a comparison-based internal sorting algorithm. Substituting M = 2 and B = P = 1 into (3.1) gives the wellknown O(N log N) bound. THEOREM 3.3. The number of I/OS required to transpose a p x 9 matrix stored in row-major order, is

o N log min(M, PB

1 + min(p,

91, 1 + N/B1

log(1 + M/B)

.

(3 5)

>

When B is large, matrix transposition is as hard as general sorting, but for smaller B, the special structure of the transposition permutation makes transposing easier. A good way to regard the expressions in the theorems is in terms of the number of “passes” through the file needed to solve the problem. One “pass” corresponds to the number of I/OS needed to read and write the file once, which is 2N/(PB). A “linear-time” algorithm (defined to be one that requires a constant number of passesthrough the file) would use O(N/PB) I/OS. The logarithmic factors that multiply the N/PB term in the above expressions indicate the degree of nonlinearity. The algorithms we use in Section 5 to achieve the upper bounds in the above theorems follow a more restrictive model of I/O, in which all I/OS are “simple” and respect track boundaries. DEFINITION 3.1. We call an input simple if each record transferred from disk is removed from the disk and

September 1988

Volume 31

Number 9

x[M + (k - l)B + 11, x[M + (k - l)B + 21, . . . , x[M + kB], as the kth track.

Each I/O performed by our algorithms transfers exactly B records, corresponding to a complete track. (Some records may be nil if the track is not full.) These assumptions are typically met (or could easily be met) in practical implementations. If we enforce these assumptions and consider the case P = 1, which corresponds to conventional disks, the resulting lower bounds and upper bounds can be made asymptotically tight; that is, the constant factors implicit in the 0 and Q bounds in the above theorems are the same: If M = NC, B = Md, P = 1, for some constants 0 < c, d < 1, the average-case lower bound for permuting and sorting and the number of I/OS used by merge sort are both asymptotically 2(l - cd)/(c(l d))N’-Cd, which is a linear function of 2N/B, the number of I/OS per pass. For example, if M = fi, B = fi, P = 1, the number of I/OS is asymptotically 4N314, which corresponds to two passesover the file to do the sort. If M = NC,B = M/log M, P = 1, the bounds are each asymptotically 2c(l - c)N1-‘log’N/log log N, and the number of passesis @(logN/log log N). When M = a, B = GM, P = 1, the worst-case upper and lower bounds are asymptotically 2 fi log N, which corresponds to % lo N passes. In the above three examples, if B = Q($ N/logbN), for some b, the permutation corresponding to the transposition of a B X N/B matrix is a worst-case permutation for the permuting and sorting problems. The restrictions adhered to by our algorithms allow our upper bounds to apply to the pebbling-based model of I/O defined by Savage and Vitter [9]. Our results answer some of the open questions posed there for sorting and FFT by providing tight upper and lower bounds. The model in [g] corresponds to our model for P = 1 with the restriction that only records that were output together in a single block can be input together in a single block. 4. PROOF OF THE LOWER

BOUNDS

Without loss of generality, we assume that B, M, and N are powers of 2 and that B < M < N. We shall consider the case P = 1 when there is only one I/O at a time; the general lower bound will follow by dividing the bound we obtain by P. For the average-case analysis of permuting and sorting, we assume that all N! inputs are equally likely. The FFT, permutation network, and matrix transposition problems have no input distribution, so the average-case and worst-case models are the same.

Communications of the ACM

1119

ResearchContributions

Permuting First we prove a useful lemma, which applies not only to permuting but also to the other problems. It allows us to assume, for purposes of obtaining the lower bound, that I/OS are simple (see Definition 3.1) and thus that exactly one copy of each record is present throughout the execution of the algorithm. LEMMA 4.1. For each computation that implements a permutation of the N records RI, R;!, , RN (or that sorts or fhat transposes or that computes the FFT digraph or a permutation network), there is a corresponding computation strategy involving only simple I/OS such that the total number of I/OS is no greater.

PROOF. It is easy to construct the simple computation strategy by working backwards. We cancel the transfer of a record if its transfer is not needed for the final result. The resulting I/O strategy is simple. Our approach is to bound the number of possible permutations that can be generated by t I/OS. If we take the value oft for which the bound reaches N!, we get a lower bound on the worst-case number of I/OS. We can get a lower bound on the average case in a similar way. DEFINITION4.1. We say that a permutation p,, p2, . . . , PNof the N records can be generated at time t if there is some sequence oft I/OS such that after the I/OS, the records appear in the correct permuted order in extended memory; that is, for all i, j, and k, we have x[i] = Rpx and

x[j] = RP,+, -

i < j.

The records do not have to be in contiguous positions in internal memory or on disk; there can be arbitrarily many empty locations between R,, and RPk+,. As mentioned above, we assume that I/OS are simple. We also make the following assumption, which does not increase the number of I/OS by more than a small constant factor. We require that each input and output transfer exactly B records, some of the records being possibly nil, and that the B records come from or go to a single track. For example, an input of b < B records, with b, records from one track and bz = b - bI records from the next track, can be simulated using an internal memory of size M + B by an input of the first track, an output of the B - bI records that are not needed (plus an additional b, nil records to take the place of the bI desired records), and then a corresponding input and output for the next track. As a consequenc:e, since I/OS are simple, a track immediately after an input or immediately before an output must be empty. We do not count internal computation time in our complexity model, so we can assume that the optimum algorithm, between I/OS, rearranges the records in internal memory however it sees appropriate. Initially, the number of permutations generated is 1. Let us consider the effect of an I/O. There can be at most N/B + t - 1 full tracks before the tth output, and the records in the tth output can go into one of at most

1120

Communicntions of the ACM

N/B + t places relative to the full tracks. Hence, the tth output changes the number of permutations generated by at most a multiplicative factor of N/B + t, which can be bounded trivially by N(l + log N). For the case of input, we first consider an input of B records from a specific track on disk. If the B records were output together during some previous output, then by our assumptions this implies that at some earlier time they were together in internal memory and were arranged in an arbitrary order by the algorithm. Thus, the B! possible orders of the B inputted records could already have been generated before the input took place. This implies in a subtle way that the increase in the number of permutations generated due to rearrangement in internal memory is at most a multiplicative factor of(y), which is the number of ways to intersperse B indistinguishable items within a group of size M. If the B records were not output together previously, then the number of permutations generated is increased by an extra factor of B!, since the B records have not yet been permuted arbitrarily. It is important to note that this extra factor of B! can appear only N/B times, namely once when the kth track is inputted for the first time, for each 1 5 k 5 N/B. The above analysis applies to input from a specific track. If the input is the tth I/O, there are at most N/B + t - 1 tracks to choose from for the I/O, plus one more because input from an empty track is also possible. Putting our results together, we find that the number of permutations generated at time t can be a multiplicative factor of at most

(; ++$)

5 N(1 + log N)B!

(;)

(4.1)

times greater than the number of permutations generated at time t - 1, if the tth I/O is the input of the kth track for the first time, for some 1 5 k 5 N/B. Otherwise, the multiplicative factor is bounded by

(; +f)($ 5N(1 + log N) ($1.

(4.4

For the worst case, we get our lower boun.d by using (4.1) and (4.2) to determine the minimum vaJue T such that the number of permutations generated is at least N!: (4.3)

The (B.)1N/B term appears because (4.1) contributes an extra B! factor over (4.2), but this can happen at most N/B times. Taking logarithms and applying Stirling’s formula to (4.3, with some algebraic manipulation, we get T(logN+Blogf)=D(NLog;).

If B log(M/B) 5 log N, then it follows that B 5 fi from (4.4) we get

(4.4)

and

September1988 Volume 31 Number 9

Research Contributions

T= Q(NZNN/B)) = Q(N).

(4.51

On the other hand, if log N C B log(M/B), then (4.4) gives us (4.6)

Combining (4.5) and (4.6) we get

with the helpful restriction that each I/O cannot depend upon the desired permutation; that is, regardless of the permutation, the records that are transferred during an I/O and the track accessed during the I/O are fixed for each I/O. This eliminates the (N/B + t) terms in (4.1) and (4.2). Each output can at most double the number of permutations generated. The lower bound on the number of I/OS follows for P = 1 by finding the smallest T such that (B!)“‘”

We get the worst-case lower bound in Theorem 3.2 by dividing (4.7) by P. For the average case, in which the N! permutations are equally likely, we can bound the average running time by the minimum value T such that (B!)N’B N(1 + log N)

2 N!/2

(4.8)

(cf. (4.3)). At least half of the permutations require 2T I/OS; hence the average time to permute is at least Y, 2T = T. The lower bound for T follows by the same steps we used to handle (4.3). Note that this lower bound for T is roughly a factor of ‘/z times the bound on Tin the worst case that follows from (4.3), but it is straightforward to derive (using (4.3) and a more careful estimate of the expected value) an average-case bound that is asymptotically the same as the worst-case bound. Our proof technique also provides the constant factors implicit in the bounds in Theorem 3.2. If we assume that I/OS are simple and respect track boundaries, then the upper and lower bounds are asymptotically exact in many cases, as mentioned at the end of Section 3. If B = o(M) and P = 1 and if log M/B either divides log N/B or else is o(log N/B), then the average number of I/OS for permuting (and sorting) is asymptotically at least 2N log(N/B)/(B log(M/B)), which is matched by merge sort. The proof of the lower bound follows from the observation that there must be as many outputs as inputs, coupled with a more careful analysis of (4.3). For the last case quoted in Section 3, the matching lower bound follows by an analysis of matrix transposition, which we do later in this section. FFT and Permutation Networks A key observation for obtaining the lower bound for the FFT is that we can construct a permutation network by stacking together three FFT digraphs, so that the output nodes of one FFT are the input nodes for the next [lo]. Thus the FFT and permutation network problems are essentially equivalent, since as we shall see the lower bound for permutation networks matches the upper bound for FFT. Let us consider an optimal I/O strategy for a permutation network. The second key observation is that the I/O sequence is fixed. This allows us to apply the lower-bound proof developed above for permuting,

September 1988

Volume 31

Number 9

0 y

T

2 N!.

(4.9)

By using Stirling’s formula, we get the same bound as in (4.6). Dividing by P gives the lower bound in Theorem 3.1. It is interesting to note that since the I/O sequence is fixed and cannot depend upon the particular permutation, we are not permitted to use the naive method of permuting, in which each block transfer moves one record from its initial to its final destination. This is reflected in the growth rate of the number of permutations generated due to a single I/O: the (N/B + f) term in the growth rate in (4.1) and (4.2) for permuting, which is dominant when the naive method is optimal, does not appear in the corresponding growth rate for permutation networks. Sorting Permuting is a special case of sorting, so the lower bound for permuting in Theorem 3.2 also applies to sorting. However, when B log(M/B) = o(log(N/B)), the lower bound becomes Q(N/P), which is not good enough. In this case, the specific knowledge of what goes where makes generating a permutation easier than sorting. We can get a better lower bound for sorting for the B log(M/B) = o(log(N/B)) case by using an adversary argument, if we restrict ourselves to the comparison model of computation. Without loss of generality, we can make the following additional assumptions, similar to the ones earlier: All I/OS are simple. Each I/O transfers B records, some possibly nil, to or from a single track on disk. We also assume that between I/OS the optimal algorithm performs all possible comparisons among the records in internal memory. Let us consider an input of B records into internal memory. If the B records were previously outputted together during an earlier output, then by our assumptions all comparisons were performed among the B records when they were together in internal memory, and their relative ordering is known. The records in internal memory before the input, which number at most M - B, have also had all possible comparisons performed. Thus, after the input, there are at most (f) sets of possible outcomes to the comparisons between the records in memory. If the B records were not previously outputted together (that is, if the input is the first input of the kth track, for some 1 5 k 5 N/B), then there are at most B!(f) sets of possible outcomes to the compari-

Communications of the ACM

1121

Research Contributions

sons. The adversary chooses the outcome that maximizes the number of total orders consistent with the comparisons done so far. It follows that (4.9) holds at time ‘r, which yields the desired lower bound. Dividing by P gives the lower bound stated in Theorem 3.1. The same result holds in the average-case model. We consider the comparison tree .with N! leaves, representing the N! total orderings. Each node in the tree represents an input operation. The nodes are constrained to have degree bounded by (f), except that each node corresponding to the input of one of tracks 1, . . . , N/B can have degree at most B!(f); there can be at most N/B such high-degree nodes along any path from the root to a leaf. The external path length divided by N!, minimized over all possible computation trees, gives the desired lower bound for P = 1. Dividing by P gives the lower bound of Theorem 3.1. Matrix Transposition We prove the lower bound using a potential function argument similar to the one used by Floyd [3]. It suffices to consider the case P = ‘1;the general lower bound will follow by dividing by P. Without loss of generality, we assume that p and 4 are powers of 2, and that all I/OS are simple and transfer exactly B records, some possibly nil. We define the ith target group, for 1 5 i 5 N/B, to be the set of records that will ultimately be in the ith track at the termination of the algorithm. We define the continuous function

f(x)=

x log x, if x > 0; if x = 0.

0 -L

(4.10)

C

POT(O)

0, =

1

B

ifmin(p,q)~B:;max(p,qj;

minip, 41’

fmadp, ql

At the end of the algorithm, there are at least T 2 outputs, and thus by (4.20)

N/B

T _ E = n POW’) B

- POW’)

B log(M/B)

September 1988



Volume 31

(4.21)

Number 9

The lower bound T = n clog min(M, 1 + min( p, q), 1 + N/B) log(1 + M/B) ( B >

(4.22)

follows by substituting (4.14) and the different cases of (4.15) into (4.21). The general lower bound in Theorem 3.3 for P > 1 follows by dividing (4.22) by P. The constant factor implicit in the above analysis matches the constant factor 2 for merge sort in several cases, if we require that all I/OS be simple and respect track boundaries, as defined at the end of Section 3. When P = 1 and B = Q(fi/logbNJ, for some b, and when log M/B either divides log N/B or else is o(log N/B), the lower bound for transposing a B x N/B matrix is asymptotically at least 2N log(N/B)/(B log(M/B)), which matches the performance of merge sort. The proof follows from the above analysis (which gives a lower bound on the number of inputs required) and the observation that there must be as many outputs as inputs. If, in addition, we substitute the bound f (x + y) f(x) - f(y) 5 x + y, for positive integers x and y, in place of (4.19), we get the same asymptotic lower bound formula for the case M = fi, B = %M, P = 1, which matches merge sort. 5. OPTIMAL

ALGORITHMS

In this section, we describe variants of merge sort and distribution sort that achieve the bounds in Theorems 3.1-3.3. As mentioned in Section 3, the algorithms follow the added restriction that records input in the same block must have been output previously in a single block, except for the first input of each track. It suffices to consider worst-case complexity, since the averagecase result follows immediately. We first discuss the sorting problem and then apply our results to get optimum algorithms for permuting, FFT, permutation networks, and matrix transposition. Without loss of generality, we can assume that B, M, and N are powers of 2 andthatB> B.) In each pass of the merging phase, M/B-’ runs are merged into one longer run. During the processing, one block from each of the runs being merged resides in internal memory. When the records of a block expire, the next track for that run is input. The resulting number of I/OS is

September 1988

Volume 31

Number 9

(5.2)

Dividing (5.2) by P’ = PB/B’ and with some algebraic manipulation, we get the desired upper bound stated in Theorem 3.1. Distribution

Sort

For simplicity, we assume that M/B is a erfect square, and we use S to denote the quantity seM/B. The main idea in the algorithm is that with O(N/(PB)) I/OS we can find S approximate partitioning elements b,, bz, . . . , bs that break up the file into roughly equal-sized “buckets.” (For completeness, we define the dummy partitioning elements b. = --03 and bs+l = +m). More precisely, we shall prove later, for 1 5 i I S + 1, that the number of records whose key value is sbi is between

Communications of the ACM

1123

Research Contributions

(i - X)N/S and (i + %)N/S. Hence, the number Ni of records in the ith bucket (that is, the number Ni of records whose key value K is in the range bi-I< K 5 bi) satisfies IN 3N -SN,S--. 2s 2s

For the time being, we assume that we can compute the approximate partitioning elements using O(N/(PB)) I/OS. Then with O(M/(PB)) additional I/OS we can input M records from disk into internal memory and partition them into the S bucket ranges. The records in each ‘bucket range can be stored on disk in contiguous groups of B records each (except possibly for the last group) with a total of O(M/(PB) + S/P) = O(M/(PB)) I/OS. This procedure is repeated for another N/M - 1 stages, in order to partition all N records into buckets. The ith bucket will thus consi.st of Gi 5 Ni/B + N/M = O(Ni/‘B) groups of at most B contiguous records, by USing inequality (5.3). The buckets are totally ordered with respect to one another. The remainder of the algorithm consists of recursively sorting the buckets oneby-one and appending the results to disk. The number of I/OS needed to input the contents of the ith bucket into internal memory during the recursive sorting is bounded by Gi/P = O(Ni/(PB)). Let US define T(n) to be the number of I/OS used to sort n records. The above construction gives us T(N) =

c

T(N,) + 0

(5.4)

The algorithm is recursively applied to the appropriate half to find the kth largest record; the total number of I/OS is 0 (n/PB)). We now describe how to apply this subroutine to find the S approximate partitioning element,5 in a set containing N records. As above, we start out by sorting N/M memoryloads of records, which can be done with O(N/(PB) + (N/B)/P) = O(N/(PB)) I/OS. Let us denote the jth sorted set by Ui. We construct a new set U’ of size at most 4N/S consisting of the %kSth records (in sorted order) of Uj, for 1 5 k I 4M/S - 1 and 1 zzj 5 N/M. Each memoryload of M records contributes 4M/S > B records to U’, so these records can be output one block at a time. The total number of contiguous groups of records comprising U’ is 0( 1U’ 1/B), so we can apply the subroutine above to find the record of rank 4iN/S* in U’ with only 0( 1U’ I/(PB)) = O(N/(SPB)) I/OS; we call its key value bi. The S hi’s can thus be found with a total of O(N/(PB)) I/OS. It is easy to show that the hi’s satisfy the conditions for being approximate partitioning elements, thus completing the proof. Permuting

The permuting problem is a special case of the sorting problem, and thus can be solved by using a sorting algorithm. To get the upper bound of Theorem 3.2, we use either of the sorting algorithms described above, unless B log(M/B) = o(log(N/B)), in which c(aseit is faster to move the records one-by-one in the naive manner to their final positions, using O(N/P) I/OS.

7 G&+1

Using the facts that Ni = O(N/S) = O(N/m) and T(M) = O(M/(PB)), we get the desired upper bound given in Theorem 3.1. All that remains to show is how to get the S approximate partitioning elements via O(N/(PB)) I/OS. Our procedure for computing the approximate partitioning elements must work for the recursive step of the algorithm, so we assume that the N records are stored in O(N/B) groups of contiguous records, each of size at most B. First we describe a subroutine that uses O(n/(PB)) I/OS to find the record with the kth smallest key (or simply the kth smallest record) in a set containing n records, in which the records are stored on disk in at most O(n/B) groups, each group consisting of IB contiguous records: We load the n records into memory, one memoryload at a time, and sort each of the m/Ml memoryloads internally. We pick the median record from each of these sorted sets and find the median of the medians using the linear-time sequential algorithm developed in [2]. The number of I/OS required for these operations is O(n/(PB) + (n/B)/P + n/M) = O(n/(PB)). We use the key value of this median record to partition the n records into two sets. It is easy to verify that each set can be partitioned into groups of size B (except possibly for the last group) in which each group is stored contiguously on disk. It is also easy to see that each of the two sets has size bounded by 3n/4.

1124

Communications of the ACM

FFT and Permutation

Networks

As mentioned in Section 4, three FFT digraphs concatenated together form a permutation network. So it suffices to consider optimum algorithms for FF’T. For simplicity, we assume that log M divides log N. The FFT digraph can be decomposed into (log N)/log M stages, as pictured in Figure 2. Stage k, for 1 5 k -C (log N)/log M, corresponds to the pebbling of columns (k - l)log M + 1, (k - 1)log M + 2, . . . , k log M in the FFT digraph. The M nodes in column (k - l:llog M that share common ancestors in column k log M are processed together in a phase. The corresponding M records are brought into internal memory via a transposition permutation, and then the next log M columns can be pebbled. The I/O requirement for each stage is thus due to the transpositions needed to rearrange the records into the proper groups of size M. The transpositions can be collectively done via a simple merging procedure described in the next subsection, which requires a total of O((N/(PB))logM,Bmin(M, N/M)) I/OS. There are (log N)/log M stages, making the total number of I/OS

which can be shown by some algebraic manipulation to equal the upper bound of Theorem 3.1.

September 1988

Volume 31

Number 9

ResearchContributions

thus

LOG N

$1* 1lOgM/s

(5.7)

We get the upper bound in Theorem 3.3 by substituting the values of x from (5.6) into (5.7) and by multiplying by 2N/PB, the number of I/OS per pass. 6. ALTERNATE RESULT

. . . ................. , \ LOG M

FIGURE2. Decomposition of the FFT digraph into stages, for N=8,M=2 Matrix

Transposition

Without loss of generality, we assume that p and 9 are powers of 2. Matrix transposition is a special case of permuting. The intuition gained from the lower bound proof in Section 4 can be used to develop a simple algorithm for achieving the upper bound in Theorem 3.3. In each track, the B records are partitioned into different target groups; each group in the decomposition is called a target subgroup. Before the start of the algorithm, the size of each target subgroup is (cf. (4.15))

I 1, x=

ifB 0 are constants. Substituting, we obtain

776

Chapter 27 Multithreaded Algorithms

T .n/ " D D "

.aFn!1 # b/ C .aFn!2 # b/ C ‚.1/ a.Fn!1 C Fn!2 / # 2b C ‚.1/ aFn # b # .b # ‚.1// aFn # b

if we choose b large enough to dominate the constant in the ‚.1/. We can then choose a large enough to satisfy the initial condition. The analytical bound (27.1) p where ! D .1 C 5/=2 is the golden ratio, now follows from equation (3.25). Since Fn grows exponentially in n, this procedure is a particularly slow way to compute Fibonacci numbers. (See Problem 31-3 for much faster ways.) Although the F IB procedure is a poor way to compute Fibonacci numbers, it makes a good example for illustrating key concepts in the analysis of multithreaded algorithms. Observe that within F IB .n/, the two recursive calls in lines 3 and 4 to F IB .n # 1/ and F IB .n # 2/, respectively, are independent of each other: they could be called in either order, and the computation performed by one in no way affects the other. Therefore, the two recursive calls can run in parallel. We augment our pseudocode to indicate parallelism by adding the concurrency keywords spawn and sync. Here is how we can rewrite the F IB procedure to use dynamic multithreading: T .n/ D ‚.! n / ;

P-F IB .n/ 1 if n " 1 2 return n 3 else x D spawn P-F IB .n # 1/ 4 y D P-F IB .n # 2/ 5 sync 6 return x C y

Notice that if we delete the concurrency keywords spawn and sync from P-F IB , the resulting pseudocode text is identical to F IB (other than renaming the procedure in the header and in the two recursive calls). We define the serialization of a multithreaded algorithm to be the serial algorithm that results from deleting the multithreaded keywords: spawn, sync, and when we examine parallel loops, parallel. Indeed, our multithreaded pseudocode has the nice property that a serialization is always ordinary serial pseudocode to solve the same problem. Nested parallelism occurs when the keyword spawn precedes a procedure call, as in line 3. The semantics of a spawn differs from an ordinary procedure call in that the procedure instance that executes the spawn—the parent—may continue to execute in parallel with the spawned subroutine—its child—instead of waiting

27.1 The basics of dynamic multithreading

777

for the child to complete, as would normally happen in a serial execution. In this case, while the spawned child is computing P-F IB .n # 1/, the parent may go on to compute P-F IB .n # 2/ in line 4 in parallel with the spawned child. Since the P-F IB procedure is recursive, these two subroutine calls themselves create nested parallelism, as do their children, thereby creating a potentially vast tree of subcomputations, all executing in parallel. The keyword spawn does not say, however, that a procedure must execute concurrently with its spawned children, only that it may. The concurrency keywords express the logical parallelism of the computation, indicating which parts of the computation may proceed in parallel. At runtime, it is up to a scheduler to determine which subcomputations actually run concurrently by assigning them to available processors as the computation unfolds. We shall discuss the theory behind schedulers shortly. A procedure cannot safely use the values returned by its spawned children until after it executes a sync statement, as in line 5. The keyword sync indicates that the procedure must wait as necessary for all its spawned children to complete before proceeding to the statement after the sync. In the P-F IB procedure, a sync is required before the return statement in line 6 to avoid the anomaly that would occur if x and y were summed before x was computed. In addition to explicit synchronization provided by the sync statement, every procedure executes a sync implicitly before it returns, thus ensuring that all its children terminate before it does. A model for multithreaded execution It helps to think of a multithreaded computation—the set of runtime instructions executed by a processor on behalf of a multithreaded program—as a directed acyclic graph G D .V; E/, called a computation dag. As an example, Figure 27.2 shows the computation dag that results from computing P-F IB .4/. Conceptually, the vertices in V are instructions, and the edges in E represent dependencies between instructions, where .u; "/ 2 E means that instruction u must execute before instruction ". For convenience, however, if a chain of instructions contains no parallel control (no spawn, sync, or return from a spawn—via either an explicit return statement or the return that happens implicitly upon reaching the end of a procedure), we may group them into a single strand, each of which represents one or more instructions. Instructions involving parallel control are not included in strands, but are represented in the structure of the dag. For example, if a strand has two successors, one of them must have been spawned, and a strand with multiple predecessors indicates the predecessors joined because of a sync statement. Thus, in the general case, the set V forms the set of strands, and the set E of directed edges represents dependencies between strands induced by parallel control.

778

Chapter 27 Multithreaded Algorithms

P-FIB(4)

P-FIB(3)

P-FIB(2)

P-FIB(1)

P-FIB(2)

P-FIB(1)

P-FIB(1)

P-FIB(0)

P-FIB(0)

Figure 27.2 A directed acyclic graph representing the computation of P-F IB.4/. Each circle represents one strand, with black circles representing either base cases or the part of the procedure (instance) up to the spawn of P-F IB.n # 1/ in line 3, shaded circles representing the part of the procedure that calls P-F IB.n # 2/ in line 4 up to the sync in line 5, where it suspends until the spawn of P-F IB.n # 1/ returns, and white circles representing the part of the procedure after the sync where it sums x and y up to the point where it returns the result. Each group of strands belonging to the same procedure is surrounded by a rounded rectangle, lightly shaded for spawned procedures and heavily shaded for called procedures. Spawn edges and call edges point downward, continuation edges point horizontally to the right, and return edges point upward. Assuming that each strand takes unit time, the work equals 17 time units, since there are 17 strands, and the span is 8 time units, since the critical path—shown with shaded edges—contains 8 strands.

If G has a directed path from strand u to strand ", we say that the two strands are (logically) in series. Otherwise, strands u and " are (logically) in parallel. We can picture a multithreaded computation as a dag of strands embedded in a tree of procedure instances. For example, Figure 27.1 shows the tree of procedure instances for P-F IB .6/ without the detailed structure showing strands. Figure 27.2 zooms in on a section of that tree, showing the strands that constitute each procedure. All directed edges connecting strands run either within a procedure or along undirected edges in the procedure tree. We can classify the edges of a computation dag to indicate the kind of dependencies between the various strands. A continuation edge .u; u0 /, drawn horizontally in Figure 27.2, connects a strand u to its successor u0 within the same procedure instance. When a strand u spawns a strand ", the dag contains a spawn edge .u; "/, which points downward in the figure. Call edges, representing normal procedure calls, also point downward. Strand u spawning strand " differs from u calling " in that a spawn induces a horizontal continuation edge from u to the strand u0 fol-

27.1 The basics of dynamic multithreading

779

lowing u in its procedure, indicating that u0 is free to execute at the same time as ", whereas a call induces no such edge. When a strand u returns to its calling procedure and x is the strand immediately following the next sync in the calling procedure, the computation dag contains return edge .u; x/, which points upward. A computation starts with a single initial strand—the black vertex in the procedure labeled P-F IB .4/ in Figure 27.2—and ends with a single final strand—the white vertex in the procedure labeled P-F IB .4/. We shall study the execution of multithreaded algorithms on an ideal parallel computer, which consists of a set of processors and a sequentially consistent shared memory. Sequential consistency means that the shared memory, which may in reality be performing many loads and stores from the processors at the same time, produces the same results as if at each step, exactly one instruction from one of the processors is executed. That is, the memory behaves as if the instructions were executed sequentially according to some global linear order that preserves the individual orders in which each processor issues its own instructions. For dynamic multithreaded computations, which are scheduled onto processors automatically by the concurrency platform, the shared memory behaves as if the multithreaded computation’s instructions were interleaved to produce a linear order that preserves the partial order of the computation dag. Depending on scheduling, the ordering could differ from one run of the program to another, but the behavior of any execution can be understood by assuming that the instructions are executed in some linear order consistent with the computation dag. In addition to making assumptions about semantics, the ideal-parallel-computer model makes some performance assumptions. Specifically, it assumes that each processor in the machine has equal computing power, and it ignores the cost of scheduling. Although this last assumption may sound optimistic, it turns out that for algorithms with sufficient “parallelism” (a term we shall define precisely in a moment), the overhead of scheduling is generally minimal in practice. Performance measures We can gauge the theoretical efficiency of a multithreaded algorithm by using two metrics: “work” and “span.” The work of a multithreaded computation is the total time to execute the entire computation on one processor. In other words, the work is the sum of the times taken by each of the strands. For a computation dag in which each strand takes unit time, the work is just the number of vertices in the dag. The span is the longest time to execute the strands along any path in the dag. Again, for a dag in which each strand takes unit time, the span equals the number of vertices on a longest or critical path in the dag. (Recall from Section 24.2 that we can find a critical path in a dag G D .V; E/ in ‚.V C E/ time.) For example, the computation dag of Figure 27.2 has 17 vertices in all and 8 vertices on its critical

780

Chapter 27 Multithreaded Algorithms

path, so that if each strand takes unit time, its work is 17 time units and its span is 8 time units. The actual running time of a multithreaded computation depends not only on its work and its span, but also on how many processors are available and how the scheduler allocates strands to processors. To denote the running time of a multithreaded computation on P processors, we shall subscript by P . For example, we might denote the running time of an algorithm on P processors by TP . The work is the running time on a single processor, or T1 . The span is the running time if we could run each strand on its own processor—in other words, if we had an unlimited number of processors—and so we denote the span by T1 . The work and span provide lower bounds on the running time TP of a multithreaded computation on P processors: !

In one step, an ideal parallel computer with P processors can do at most P units of work, and thus in TP time, it can perform at most P TP work. Since the total work to do is T1 , we have P TP ! T1 . Dividing by P yields the work law: TP ! T1 =P :

!

(27.2)

A P -processor ideal parallel computer cannot run any faster than a machine with an unlimited number of processors. Looked at another way, a machine with an unlimited number of processors can emulate a P -processor machine by using just P of its processors. Thus, the span law follows: TP ! T1 :

(27.3)

We define the speedup of a computation on P processors by the ratio T1 =TP , which says how many times faster the computation is on P processors than on 1 processor. By the work law, we have TP ! T1 =P , which implies that T1 =TP " P . Thus, the speedup on P processors can be at most P . When the speedup is linear in the number of processors, that is, when T1 =TP D ‚.P /, the computation exhibits linear speedup, and when T1 =TP D P , we have perfect linear speedup. The ratio T1 =T1 of the work to the span gives the parallelism of the multithreaded computation. We can view the parallelism from three perspectives. As a ratio, the parallelism denotes the average amount of work that can be performed in parallel for each step along the critical path. As an upper bound, the parallelism gives the maximum possible speedup that can be achieved on any number of processors. Finally, and perhaps most important, the parallelism provides a limit on the possibility of attaining perfect linear speedup. Specifically, once the number of processors exceeds the parallelism, the computation cannot possibly achieve perfect linear speedup. To see this last point, suppose that P > T1 =T1 , in which case

27.1 The basics of dynamic multithreading

781

the span law implies that the speedup satisfies T1 =TP " T1 =T1 < P . Moreover, if the number P of processors in the ideal parallel computer greatly exceeds the parallelism—that is, if P $ T1 =T1 —then T1 =TP % P , so that the speedup is much less than the number of processors. In other words, the more processors we use beyond the parallelism, the less perfect the speedup. As an example, consider the computation P-F IB .4/ in Figure 27.2, and assume that each strand takes unit time. Since the work is T1 D 17 and the span is T1 D 8, the parallelism is T1 =T1 D 17=8 D 2:125. Consequently, achieving much more than double the speedup is impossible, no matter how many processors we employ to execute the computation. For larger input sizes, however, we shall see that P-F IB .n/ exhibits substantial parallelism. We define the (parallel) slackness of a multithreaded computation executed on an ideal parallel computer with P processors to be the ratio .T1 =T1 /=P D T1 =.P T1 /, which is the factor by which the parallelism of the computation exceeds the number of processors in the machine. Thus, if the slackness is less than 1, we cannot hope to achieve perfect linear speedup, because T1 =.P T1 / < 1 and the span law imply that the speedup on P processors satisfies T1 =TP " T1 =T1 < P . Indeed, as the slackness decreases from 1 toward 0, the speedup of the computation diverges further and further from perfect linear speedup. If the slackness is greater than 1, however, the work per processor is the limiting constraint. As we shall see, as the slackness increases from 1, a good scheduler can achieve closer and closer to perfect linear speedup. Scheduling Good performance depends on more than just minimizing the work and span. The strands must also be scheduled efficiently onto the processors of the parallel machine. Our multithreaded programming model provides no way to specify which strands to execute on which processors. Instead, we rely on the concurrency platform’s scheduler to map the dynamically unfolding computation to individual processors. In practice, the scheduler maps the strands to static threads, and the operating system schedules the threads on the processors themselves, but this extra level of indirection is unnecessary for our understanding of scheduling. We can just imagine that the concurrency platform’s scheduler maps strands to processors directly. A multithreaded scheduler must schedule the computation with no advance knowledge of when strands will be spawned or when they will complete—it must operate on-line. Moreover, a good scheduler operates in a distributed fashion, where the threads implementing the scheduler cooperate to load-balance the computation. Provably good on-line, distributed schedulers exist, but analyzing them is complicated.

782

Chapter 27 Multithreaded Algorithms

Instead, to keep our analysis simple, we shall investigate an on-line centralized scheduler, which knows the global state of the computation at any given time. In particular, we shall analyze greedy schedulers, which assign as many strands to processors as possible in each time step. If at least P strands are ready to execute during a time step, we say that the step is a complete step, and a greedy scheduler assigns any P of the ready strands to processors. Otherwise, fewer than P strands are ready to execute, in which case we say that the step is an incomplete step, and the scheduler assigns each ready strand to its own processor. From the work law, the best running time we can hope for on P processors is TP D T1 =P , and from the span law the best we can hope for is TP D T1 . The following theorem shows that greedy scheduling is provably good in that it achieves the sum of these two lower bounds as an upper bound. Theorem 27.1 On an ideal parallel computer with P processors, a greedy scheduler executes a multithreaded computation with work T1 and span T1 in time (27.4)

TP " T1 =P C T1 :

Proof We start by considering the complete steps. In each complete step, the P processors together perform a total of P work. Suppose for the purpose of contradiction that the number of complete steps is strictly greater than bT1 =P c. Then, the total work of the complete steps is at least P & .bT1 =P c C 1/ D P bT1 =P c C P D T1 # .T1 mod P / C P > T1

(by equation (3.8)) (by inequality (3.9)) .

Thus, we obtain the contradiction that the P processors would perform more work than the computation requires, which allows us to conclude that the number of complete steps is at most bT1 =P c. Now, consider an incomplete step. Let G be the dag representing the entire computation, and without loss of generality, assume that each strand takes unit time. (We can replace each longer strand by a chain of unit-time strands.) Let G 0 be the subgraph of G that has yet to be executed at the start of the incomplete step, and let G 00 be the subgraph remaining to be executed after the incomplete step. A longest path in a dag must necessarily start at a vertex with in-degree 0. Since an incomplete step of a greedy scheduler executes all strands with in-degree 0 in G 0 , the length of a longest path in G 00 must be 1 less than the length of a longest path in G 0 . In other words, an incomplete step decreases the span of the unexecuted dag by 1. Hence, the number of incomplete steps is at most T1 . Since each step is either complete or incomplete, the theorem follows.

27.1 The basics of dynamic multithreading

783

The following corollary to Theorem 27.1 shows that a greedy scheduler always performs well. Corollary 27.2 The running time TP of any multithreaded computation scheduled by a greedy scheduler on an ideal parallel computer with P processors is within a factor of 2 of optimal. Proof Let TP" be the running time produced by an optimal scheduler on a machine with P processors, and let T1 and T1 be the work and span of the computation, respectively. Since the work and span laws—inequalities (27.2) and (27.3)—give us TP" ! max.T1 =P; T1 /, Theorem 27.1 implies that TP

" T1 =P C T1 " 2 & max.T1 =P; T1 / " 2TP" :

The next corollary shows that, in fact, a greedy scheduler achieves near-perfect linear speedup on any multithreaded computation as the slackness grows. Corollary 27.3 Let TP be the running time of a multithreaded computation produced by a greedy scheduler on an ideal parallel computer with P processors, and let T1 and T1 be the work and span of the computation, respectively. Then, if P % T1 =T1 , we have TP ' T1 =P , or equivalently, a speedup of approximately P . Proof If we suppose that P % T1 =T1 , then we also have T1 % T1 =P , and hence Theorem 27.1 gives us TP " T1 =P C T1 ' T1 =P . Since the work law (27.2) dictates that TP ! T1 =P , we conclude that TP ' T1 =P , or equivalently, that the speedup is T1 =TP ' P .

The % symbol denotes “much less,” but how much is “much less”? As a rule of thumb, a slackness of at least 10—that is, 10 times more parallelism than processors—generally suffices to achieve good speedup. Then, the span term in the greedy bound, inequality 27.4, is less than 10% of the work-per-processor term, which is good enough for most engineering situations. For example, if a computation runs on only 10 or 100 processors, it doesn’t make sense to value parallelism of, say 1,000,000 over parallelism of 10,000, even with the factor of 100 difference. As Problem 27-2 shows, sometimes by reducing extreme parallelism, we can obtain algorithms that are better with respect to other concerns and which still scale up well on reasonable numbers of processors.

784

Chapter 27 Multithreaded Algorithms

A A

B B

Work: T1 .A [ B/ D T1 .A/ C T1 .B/ Span: T1 .A [ B/ D T1 .A/ C T1 .B/

Work: T1 .A [ B/ D T1 .A/ C T1 .B/ Span: T1 .A [ B/ D max.T1 .A/; T1 .B/)

(a)

(b)

Figure 27.3 The work and span of composed subcomputations. (a) When two subcomputations are joined in series, the work of the composition is the sum of their work, and the span of the composition is the sum of their spans. (b) When two subcomputations are joined in parallel, the work of the composition remains the sum of their work, but the span of the composition is only the maximum of their spans.

Analyzing multithreaded algorithms We now have all the tools we need to analyze multithreaded algorithms and provide good bounds on their running times on various numbers of processors. Analyzing the work is relatively straightforward, since it amounts to nothing more than analyzing the running time of an ordinary serial algorithm—namely, the serialization of the multithreaded algorithm—which you should already be familiar with, since that is what most of this textbook is about! Analyzing the span is more interesting, but generally no harder once you get the hang of it. We shall investigate the basic ideas using the P-F IB program. Analyzing the work T1 .n/ of P-F IB .n/ poses no hurdles, because we’ve already done it. The original F IB procedure is essentially the serialization of P-F IB , and hence T1 .n/ D T .n/ D ‚.! n / from equation (27.1). Figure 27.3 illustrates how to analyze the span. If two subcomputations are joined in series, their spans add to form the span of their composition, whereas if they are joined in parallel, the span of their composition is the maximum of the spans of the two subcomputations. For P-F IB .n/, the spawned call to P-F IB .n # 1/ in line 3 runs in parallel with the call to P-F IB .n # 2/ in line 4. Hence, we can express the span of P-F IB .n/ as the recurrence T1 .n/ D max.T1 .n # 1/; T1 .n # 2// C ‚.1/ D T1 .n # 1/ C ‚.1/ ;

which has solution T1 .n/ D ‚.n/. The parallelism of P-F IB .n/ is T1 .n/=T1 .n/ D ‚.! n =n/, which grows dramatically as n gets large. Thus, on even the largest parallel computers, a modest

27.1 The basics of dynamic multithreading

785

value for n suffices to achieve near perfect linear speedup for P-F IB .n/, because this procedure exhibits considerable parallel slackness. Parallel loops Many algorithms contain loops all of whose iterations can operate in parallel. As we shall see, we can parallelize such loops using the spawn and sync keywords, but it is much more convenient to specify directly that the iterations of such loops can run concurrently. Our pseudocode provides this functionality via the parallel concurrency keyword, which precedes the for keyword in a for loop statement. As an example, consider the problem of multiplying an n ( n matrix A D .aij / by an n-vector x D .xj /. The resulting n-vector y D .yi / is given by the equation yi D

n X

aij xj ;

j D1

for i D 1; 2; : : : ; n. We can perform matrix-vector multiplication by computing all the entries of y in parallel as follows: M AT-V EC .A; x/ 1 n D A:rows 2 let y be a new vector of length n 3 parallel for i D 1 to n 4 yi D 0 5 parallel for i D 1 to n 6 for j D 1 to n 7 yi D yi C aij xj 8 return y In this code, the parallel for keywords in lines 3 and 5 indicate that the iterations of the respective loops may be run concurrently. A compiler can implement each parallel for loop as a divide-and-conquer subroutine using nested parallelism. For example, the parallel for loop in lines 5–7 can be implemented with the call M AT-V EC -M AIN -L OOP .A; x; y; n; 1; n/, where the compiler produces the auxiliary subroutine M AT-V EC -M AIN -L OOP as follows:

786

Chapter 27 Multithreaded Algorithms

1,8

1,4

5,8

1,2

1,1

3,4

2,2

3,3

5,6

4,4

5,5

7,8

6,6

7,7

8,8

Figure 27.4 A dag representing the computation of M AT-V EC -M AIN -L OOP.A; x; y; 8; 1; 8/. The two numbers within each rounded rectangle give the values of the last two parameters (i and i 0 in the procedure header) in the invocation (spawn or call) of the procedure. The black circles represent strands corresponding to either the base case or the part of the procedure up to the spawn of M AT-V EC -M AIN -L OOP in line 5; the shaded circles represent strands corresponding to the part of the procedure that calls M AT-V EC -M AIN -L OOP in line 6 up to the sync in line 7, where it suspends until the spawned subroutine in line 5 returns; and the white circles represent strands corresponding to the (negligible) part of the procedure after the sync up to the point where it returns.

M AT-V EC -M AIN -L OOP .A; x; y; n; i; i 0 / 1 if i == i 0 2 for j D 1 to n 3 yi D yi C aij xj 4 else mid D b.i C i 0 /=2c 5 spawn M AT-V EC -M AIN -L OOP .A; x; y; n; i; mid/ 6 M AT-V EC -M AIN -L OOP .A; x; y; n; mid C 1; i 0 / 7 sync This code recursively spawns the first half of the iterations of the loop to execute in parallel with the second half of the iterations and then executes a sync, thereby creating a binary tree of execution where the leaves are individual loop iterations, as shown in Figure 27.4. To calculate the work T1 .n/ of M AT-V EC on an n(n matrix, we simply compute the running time of its serialization, which we obtain by replacing the parallel for loops with ordinary for loops. Thus, we have T1 .n/ D ‚.n2 /, because the quadratic running time of the doubly nested loops in lines 5–7 dominates. This analysis

27.1 The basics of dynamic multithreading

787

seems to ignore the overhead for recursive spawning in implementing the parallel loops, however. In fact, the overhead of recursive spawning does increase the work of a parallel loop compared with that of its serialization, but not asymptotically. To see why, observe that since the tree of recursive procedure instances is a full binary tree, the number of internal nodes is 1 fewer than the number of leaves (see Exercise B.5-3). Each internal node performs constant work to divide the iteration range, and each leaf corresponds to an iteration of the loop, which takes at least constant time (‚.n/ time in this case). Thus, we can amortize the overhead of recursive spawning against the work of the iterations, contributing at most a constant factor to the overall work. As a practical matter, dynamic-multithreading concurrency platforms sometimes coarsen the leaves of the recursion by executing several iterations in a single leaf, either automatically or under programmer control, thereby reducing the overhead of recursive spawning. This reduced overhead comes at the expense of also reducing the parallelism, however, but if the computation has sufficient parallel slackness, near-perfect linear speedup need not be sacrificed. We must also account for the overhead of recursive spawning when analyzing the span of a parallel-loop construct. Since the depth of recursive calling is logarithmic in the number of iterations, for a parallel loop with n iterations in which the ith iteration has span iter1 .i/, the span is

T1 .n/ D ‚.lg n/ C max iter1 .i/ : 1#i #n

For example, for M AT-V EC on an n ( n matrix, the parallel initialization loop in lines 3–4 has span ‚.lg n/, because the recursive spawning dominates the constanttime work of each iteration. The span of the doubly nested loops in lines 5–7 is ‚.n/, because each iteration of the outer parallel for loop contains n iterations of the inner (serial) for loop. The span of the remaining code in the procedure is constant, and thus the span is dominated by the doubly nested loops, yielding an overall span of ‚.n/ for the whole procedure. Since the work is ‚.n2 /, the parallelism is ‚.n2 /=‚.n/ D ‚.n/. (Exercise 27.1-6 asks you to provide an implementation with even more parallelism.) Race conditions A multithreaded algorithm is deterministic if it always does the same thing on the same input, no matter how the instructions are scheduled on the multicore computer. It is nondeterministic if its behavior might vary from run to run. Often, a multithreaded algorithm that is intended to be deterministic fails to be, because it contains a “determinacy race.” Race conditions are the bane of concurrency. Famous race bugs include the Therac-25 radiation therapy machine, which killed three people and injured sev-

788

Chapter 27 Multithreaded Algorithms

eral others, and the North American Blackout of 2003, which left over 50 million people without power. These pernicious bugs are notoriously hard to find. You can run tests in the lab for days without a failure only to discover that your software sporadically crashes in the field. A determinacy race occurs when two logically parallel instructions access the same memory location and at least one of the instructions performs a write. The following procedure illustrates a race condition: R ACE -E XAMPLE . / 1 x D0 2 parallel for i D 1 to 2 3 x D xC1 4 print x After initializing x to 0 in line 1, R ACE -E XAMPLE creates two parallel strands, each of which increments x in line 3. Although it might seem that R ACE E XAMPLE should always print the value 2 (its serialization certainly does), it could instead print the value 1. Let’s see how this anomaly might occur. When a processor increments x, the operation is not indivisible, but is composed of a sequence of instructions: 1. Read x from memory into one of the processor’s registers. 2. Increment the value in the register.

3. Write the value in the register back into x in memory. Figure 27.5(a) illustrates a computation dag representing the execution of R ACE E XAMPLE, with the strands broken down to individual instructions. Recall that since an ideal parallel computer supports sequential consistency, we can view the parallel execution of a multithreaded algorithm as an interleaving of instructions that respects the dependencies in the dag. Part (b) of the figure shows the values in an execution of the computation that elicits the anomaly. The value x is stored in memory, and r1 and r2 are processor registers. In step 1, one of the processors sets x to 0. In steps 2 and 3, processor 1 reads x from memory into its register r1 and increments it, producing the value 1 in r1 . At that point, processor 2 comes into the picture, executing instructions 4–6. Processor 2 reads x from memory into register r2 ; increments it, producing the value 1 in r2 ; and then stores this value into x, setting x to 1. Now, processor 1 resumes with step 7, storing the value 1 in r1 into x, which leaves the value of x unchanged. Therefore, step 8 prints the value 1, rather than 2, as the serialization would print. We can see what has happened. If the effect of the parallel execution were that processor 1 executed all its instructions before processor 2, the value 2 would be

27.1 The basics of dynamic multithreading

1

x=0

2

r1 = x

4

r2 = x

3

incr r1

5

incr r2

7

x = r1

6

x = r2

8

print x (a)

789

step

x

r1

r2

1 2 3 4 5 6 7

0 0 0 0 0 1 1

– 0 1 1 1 1 1

– – – 0 1 1 1

(b)

Figure 27.5 Illustration of the determinacy race in R ACE -E XAMPLE . (a) A computation dag showing the dependencies among individual instructions. The processor registers are r1 and r2 . Instructions unrelated to the race, such as the implementation of loop control, are omitted. (b) An execution sequence that elicits the bug, showing the values of x in memory and registers r1 and r2 for each step in the execution sequence.

printed. Conversely, if the effect were that processor 2 executed all its instructions before processor 1, the value 2 would still be printed. When the instructions of the two processors execute at the same time, however, it is possible, as in this example execution, that one of the updates to x is lost. Of course, many executions do not elicit the bug. For example, if the execution order were h1; 2; 3; 4; 5; 6; 7; 8i or h1; 4; 5; 6; 2; 3; 7; 8i, we would get the correct result. That’s the problem with determinacy races. Generally, most orderings produce correct results—such as any in which the instructions on the left execute before the instructions on the right, or vice versa. But some orderings generate improper results when the instructions interleave. Consequently, races can be extremely hard to test for. You can run tests for days and never see the bug, only to experience a catastrophic system crash in the field when the outcome is critical. Although we can cope with races in a variety of ways, including using mutualexclusion locks and other methods of synchronization, for our purposes, we shall simply ensure that strands that operate in parallel are independent: they have no determinacy races among them. Thus, in a parallel for construct, all the iterations should be independent. Between a spawn and the corresponding sync, the code of the spawned child should be independent of the code of the parent, including code executed by additional spawned or called children. Note that arguments to a spawned child are evaluated in the parent before the actual spawn occurs, and thus the evaluation of arguments to a spawned subroutine is in series with any accesses to those arguments after the spawn.

790

Chapter 27 Multithreaded Algorithms

As an example of how easy it is to generate code with races, here is a faulty implementation of multithreaded matrix-vector multiplication that achieves a span of ‚.lg n/ by parallelizing the inner for loop: M AT-V EC -W RONG .A; x/ 1 n D A:rows 2 let y be a new vector of length n 3 parallel for i D 1 to n 4 yi D 0 5 parallel for i D 1 to n 6 parallel for j D 1 to n 7 yi D yi C aij xj 8 return y This procedure is, unfortunately, incorrect due to races on updating yi in line 7, which executes concurrently for all n values of j . Exercise 27.1-6 asks you to give a correct implementation with ‚.lg n/ span. A multithreaded algorithm with races can sometimes be correct. As an example, two parallel threads might store the same value into a shared variable, and it wouldn’t matter which stored the value first. Generally, however, we shall consider code with races to be illegal. A chess lesson We close this section with a true story that occurred during the development of the world-class multithreaded chess-playing program ?Socrates [81], although the timings below have been simplified for exposition. The program was prototyped on a 32-processor computer but was ultimately to run on a supercomputer with 512 processors. At one point, the developers incorporated an optimization into the program that reduced its running time on an important benchmark on the 32-processor 0 D 40 seconds. Yet, the developers used machine from T32 D 65 seconds to T32 the work and span performance measures to conclude that the optimized version, which was faster on 32 processors, would actually be slower than the original version on 512 processsors. As a result, they abandoned the “optimization.” Here is their analysis. The original version of the program had work T1 D 2048 seconds and span T1 D 1 second. If we treat inequality (27.4) as an equation, TP D T1 =P C T1 , and use it as an approximation to the running time on P processors, we see that indeed T32 D 2048=32 C 1 D 65. With the optimization, the 0 work became T10 D 1024 seconds and the span became T1 D 8 seconds. Again 0 using our approximation, we get T32 D 1024=32 C 8 D 40. The relative speeds of the two versions switch when we calculate the running times on 512 processors, however. In particular, we have T512 D 2048=512C1 D 5

27.1 The basics of dynamic multithreading

791

0 seconds, and T512 D 1024=512 C 8 D 10 seconds. The optimization that sped up the program on 32 processors would have made the program twice as slow on 512 processors! The optimized version’s span of 8, which was not the dominant term in the running time on 32 processors, became the dominant term on 512 processors, nullifying the advantage from using more processors. The moral of the story is that work and span can provide a better means of extrapolating performance than can measured running times.

Exercises 27.1-1 Suppose that we spawn P-F IB .n # 2/ in line 4 of P-F IB, rather than calling it as is done in the code. What is the impact on the asymptotic work, span, and parallelism? 27.1-2 Draw the computation dag that results from executing P-F IB .5/. Assuming that each strand in the computation takes unit time, what are the work, span, and parallelism of the computation? Show how to schedule the dag on 3 processors using greedy scheduling by labeling each strand with the time step in which it is executed. 27.1-3 Prove that a greedy scheduler achieves the following time bound, which is slightly stronger than the bound proven in Theorem 27.1: TP "

T1 # T1 C T1 : P

(27.5)

27.1-4 Construct a computation dag for which one execution of a greedy scheduler can take nearly twice the time of another execution of a greedy scheduler on the same number of processors. Describe how the two executions would proceed. 27.1-5 Professor Karan measures her deterministic multithreaded algorithm on 4, 10, and 64 processors of an ideal parallel computer using a greedy scheduler. She claims that the three runs yielded T4 D 80 seconds, T10 D 42 seconds, and T64 D 10 seconds. Argue that the professor is either lying or incompetent. (Hint: Use the work law (27.2), the span law (27.3), and inequality (27.5) from Exercise 27.1-3.)

792

Chapter 27 Multithreaded Algorithms

27.1-6 Give a multithreaded algorithm to multiply an n ( n matrix by an n-vector that achieves ‚.n2 = lg n/ parallelism while maintaining ‚.n2 / work. 27.1-7 Consider the following multithreaded pseudocode for transposing an n(n matrix A in place: P-T RANSPOSE .A/ 1 n D A:rows 2 parallel for j D 2 to n 3 parallel for i D 1 to j # 1 4 exchange aij with aj i Analyze the work, span, and parallelism of this algorithm. 27.1-8 Suppose that we replace the parallel for loop in line 3 of P-T RANSPOSE (see Exercise 27.1-7) with an ordinary for loop. Analyze the work, span, and parallelism of the resulting algorithm. 27.1-9 For how many processors do the two versions of the chess programs run equally fast, assuming that TP D T1 =P C T1 ?

27.2 Multithreaded matrix multiplication In this section, we examine how to multithread matrix multiplication, a problem whose serial running time we studied in Section 4.2. We’ll look at multithreaded algorithms based on the standard triply nested loop, as well as divide-and-conquer algorithms. Multithreaded matrix multiplication The first algorithm we study is the straighforward algorithm based on parallelizing the loops in the procedure S QUARE -M ATRIX -M ULTIPLY on page 75:

27.2 Multithreaded matrix multiplication

793

P-S QUARE -M ATRIX -M ULTIPLY .A; B/ 1 n D A:rows 2 let C be a new n ( n matrix 3 parallel for i D 1 to n 4 parallel for j D 1 to n 5 cij D 0 6 for k D 1 to n 7 cij D cij C ai k & bkj 8 return C To analyze this algorithm, observe that since the serialization of the algorithm is just S QUARE -M ATRIX -M ULTIPLY, the work is therefore simply T1 .n/ D ‚.n3 /, the same as the running time of S QUARE -M ATRIX -M ULTIPLY. The span is T1 .n/ D ‚.n/, because it follows a path down the tree of recursion for the parallel for loop starting in line 3, then down the tree of recursion for the parallel for loop starting in line 4, and then executes all n iterations of the ordinary for loop starting in line 6, resulting in a total span of ‚.lg n/ C ‚.lg n/ C ‚.n/ D ‚.n/. Thus, the parallelism is ‚.n3 /=‚.n/ D ‚.n2 /. Exercise 27.2-3 asks you to parallelize the inner loop to obtain a parallelism of ‚.lg n/, which you cannot do straightforwardly using parallel for, because you would create races. A divide-and-conquer multithreaded algorithm for matrix multiplication As we learned in Section 4.2, we can multiply n ( n matrices serially in time ‚.nlg 7 / D O.n2:81 / using Strassen’s divide-and-conquer strategy, which motivates us to look at multithreading such an algorithm. We begin, as we did in Section 4.2, with multithreading a simpler divide-and-conquer algorithm. Recall from page 77 that the S QUARE -M ATRIX -M ULTIPLY-R ECURSIVE procedure, which multiplies two n ( n matrices A and B to produce the n ( n matrix C , relies on partitioning each of the three matrices into four n=2 ( n=2 submatrices: ! " ! " ! " A11 A12 B11 B12 C11 C12 AD ; BD ; C D : A21 A22 B21 B22 C21 C22 Then, we can write the matrix product as ! " ! "! " C11 C12 A11 A12 B11 B12 D C21 C22 A21 A22 B21 B22 ! " ! " A11 B11 A11 B12 A12 B21 A12 B22 D C : A21 B11 A21 B12 A22 B21 A22 B22

(27.6)

Thus, to multiply two n(n matrices, we perform eight multiplications of n=2(n=2 matrices and one addition of n(n matrices. The following pseudocode implements

794

Chapter 27 Multithreaded Algorithms

this divide-and-conquer strategy using nested parallelism. Unlike the S QUARE M ATRIX -M ULTIPLY-R ECURSIVE procedure on which it is based, P-M ATRIX M ULTIPLY-R ECURSIVE takes the output matrix as a parameter to avoid allocating matrices unnecessarily. P-M ATRIX -M ULTIPLY-R ECURSIVE .C; A; B/ 1 n D A:rows 2 if n == 1 3 c11 D a11 b11 4 else let T be a new n ( n matrix 5 partition A, B, C , and T into n=2 ( n=2 submatrices A11 ; A12 ; A21 ; A22 ; B11 ; B12 ; B21 ; B22 ; C11 ; C12 ; C21 ; C22 ; and T11 ; T12 ; T21 ; T22 ; respectively 6 spawn P-M ATRIX -M ULTIPLY-R ECURSIVE .C11 ; A11 ; B11 / 7 spawn P-M ATRIX -M ULTIPLY-R ECURSIVE .C12 ; A11 ; B12 / 8 spawn P-M ATRIX -M ULTIPLY-R ECURSIVE .C21 ; A21 ; B11 / 9 spawn P-M ATRIX -M ULTIPLY-R ECURSIVE .C22 ; A21 ; B12 / 10 spawn P-M ATRIX -M ULTIPLY-R ECURSIVE .T11 ; A12 ; B21 / 11 spawn P-M ATRIX -M ULTIPLY-R ECURSIVE .T12 ; A12 ; B22 / 12 spawn P-M ATRIX -M ULTIPLY-R ECURSIVE .T21 ; A22 ; B21 / 13 P-M ATRIX -M ULTIPLY-R ECURSIVE .T22 ; A22 ; B22 / 14 sync 15 parallel for i D 1 to n 16 parallel for j D 1 to n 17 cij D cij C tij Line 3 handles the base case, where we are multiplying 1 ( 1 matrices. We handle the recursive case in lines 4–17. We allocate a temporary matrix T in line 4, and line 5 partitions each of the matrices A, B, C , and T into n=2 ( n=2 submatrices. (As with S QUARE -M ATRIX -M ULTIPLY-R ECURSIVE on page 77, we gloss over the minor issue of how to use index calculations to represent submatrix sections of a matrix.) The recursive call in line 6 sets the submatrix C11 to the submatrix product A11 B11 , so that C11 equals the first of the two terms that form its sum in equation (27.6). Similarly, lines 7–9 set C12 , C21 , and C22 to the first of the two terms that equal their sums in equation (27.6). Line 10 sets the submatrix T11 to the submatrix product A12 B21 , so that T11 equals the second of the two terms that form C11 ’s sum. Lines 11–13 set T12 , T21 , and T22 to the second of the two terms that form the sums of C12 , C21 , and C22 , respectively. The first seven recursive calls are spawned, and the last one runs in the main strand. The sync statement in line 14 ensures that all the submatrix products in lines 6–13 have been computed,

27.2 Multithreaded matrix multiplication

795

after which we add the products from T into C in using the doubly nested parallel for loops in lines 15–17. We first analyze the work M1 .n/ of the P-M ATRIX -M ULTIPLY-R ECURSIVE procedure, echoing the serial running-time analysis of its progenitor S QUARE M ATRIX -M ULTIPLY-R ECURSIVE. In the recursive case, we partition in ‚.1/ time, perform eight recursive multiplications of n=2 ( n=2 matrices, and finish up with the ‚.n2 / work from adding two n ( n matrices. Thus, the recurrence for the work M1 .n/ is M1 .n/ D 8M1 .n=2/ C ‚.n2 / D ‚.n3 /

by case 1 of the master theorem. In other words, the work of our multithreaded algorithm is asymptotically the same as the running time of the procedure S QUARE M ATRIX -M ULTIPLY in Section 4.2, with its triply nested loops. To determine the span M1 .n/ of P-M ATRIX -M ULTIPLY-R ECURSIVE, we first observe that the span for partitioning is ‚.1/, which is dominated by the ‚.lg n/ span of the doubly nested parallel for loops in lines 15–17. Because the eight parallel recursive calls all execute on matrices of the same size, the maximum span for any recursive call is just the span of any one. Hence, the recurrence for the span M1 .n/ of P-M ATRIX -M ULTIPLY-R ECURSIVE is M1 .n/ D M1 .n=2/ C ‚.lg n/ :

(27.7)

This recurrence does not fall under any of the cases of the master theorem, but it does meet the condition of Exercise 4.6-2. By Exercise 4.6-2, therefore, the solution to recurrence (27.7) is M1 .n/ D ‚.lg2 n/. Now that we know the work and span of P-M ATRIX -M ULTIPLY-R ECURSIVE, we can compute its parallelism as M1 .n/=M1 .n/ D ‚.n3 = lg2 n/, which is very high. Multithreading Strassen’s method To multithread Strassen’s algorithm, we follow the same general outline as on page 79, only using nested parallelism: 1. Divide the input matrices A and B and output matrix C into n=2 ( n=2 submatrices, as in equation (27.6). This step takes ‚.1/ work and span by index calculation.

2. Create 10 matrices S1 ; S2 ; : : : ; S10 , each of which is n=2 ( n=2 and is the sum or difference of two matrices created in step 1. We can create all 10 matrices with ‚.n2 / work and ‚.lg n/ span by using doubly nested parallel for loops.

796

Chapter 27 Multithreaded Algorithms

3. Using the submatrices created in step 1 and the 10 matrices created in step 2, recursively spawn the computation of seven n=2 ( n=2 matrix products P1 ; P2 ; : : : ; P7 .

4. Compute the desired submatrices C11 ; C12 ; C21 ; C22 of the result matrix C by adding and subtracting various combinations of the Pi matrices, once again using doubly nested parallel for loops. We can compute all four submatrices with ‚.n2 / work and ‚.lg n/ span.

To analyze this algorithm, we first observe that since the serialization is the same as the original serial algorithm, the work is just the running time of the serialization, namely, ‚.nlg 7 /. As for P-M ATRIX -M ULTIPLY-R ECURSIVE, we can devise a recurrence for the span. In this case, seven recursive calls execute in parallel, but since they all operate on matrices of the same size, we obtain the same recurrence (27.7) as we did for P-M ATRIX -M ULTIPLY-R ECURSIVE, which has solution ‚.lg2 n/. Thus, the parallelism of multithreaded Strassen’s method is ‚.nlg 7 = lg2 n/, which is high, though slightly less than the parallelism of P-M ATRIX -M ULTIPLY-R ECURSIVE . Exercises 27.2-1 Draw the computation dag for computing P-S QUARE -M ATRIX -M ULTIPLY on 2(2 matrices, labeling how the vertices in your diagram correspond to strands in the execution of the algorithm. Use the convention that spawn and call edges point downward, continuation edges point horizontally to the right, and return edges point upward. Assuming that each strand takes unit time, analyze the work, span, and parallelism of this computation. 27.2-2 Repeat Exercise 27.2-1 for P-M ATRIX -M ULTIPLY-R ECURSIVE. 27.2-3 Give pseudocode for a multithreaded algorithm that multiplies two n ( n matrices with work ‚.n3 / but span only ‚.lg n/. Analyze your algorithm. 27.2-4 Give pseudocode for an efficient multithreaded algorithm that multiplies a p ( q matrix by a q ( r matrix. Your algorithm should be highly parallel even if any of p, q, and r are 1. Analyze your algorithm.

27.3 Multithreaded merge sort

797

27.2-5 Give pseudocode for an efficient multithreaded algorithm that transposes an n ( n matrix in place by using divide-and-conquer to divide the matrix recursively into four n=2 ( n=2 submatrices. Analyze your algorithm. 27.2-6 Give pseudocode for an efficient multithreaded implementation of the FloydWarshall algorithm (see Section 25.2), which computes shortest paths between all pairs of vertices in an edge-weighted graph. Analyze your algorithm.

27.3 Multithreaded merge sort We first saw serial merge sort in Section 2.3.1, and in Section 2.3.2 we analyzed its running time and showed it to be ‚.n lg n/. Because merge sort already uses the divide-and-conquer paradigm, it seems like a terrific candidate for multithreading using nested parallelism. We can easily modify the pseudocode so that the first recursive call is spawned: M ERGE -S ORT0 .A; p; r/ 1 if p < r 2 q D b.p C r/=2c 3 spawn M ERGE -S ORT 0 .A; p; q/ 4 M ERGE -S ORT 0 .A; q C 1; r/ 5 sync 6 M ERGE .A; p; q; r/ Like its serial counterpart, M ERGE -S ORT 0 sorts the subarray AŒp : : r#. After the two recursive subroutines in lines 3 and 4 have completed, which is ensured by the sync statement in line 5, M ERGE -S ORT 0 calls the same M ERGE procedure as on page 31. Let us analyze M ERGE -S ORT 0 . To do so, we first need to analyze M ERGE. Recall that its serial running time to merge n elements is ‚.n/. Because M ERGE is serial, both its work and its span are ‚.n/. Thus, the following recurrence characterizes the work MS01 .n/ of M ERGE -S ORT 0 on n elements: MS01 .n/ D 2 MS01 .n=2/ C ‚.n/ D ‚.n lg n/ ;

798

Chapter 27 Multithreaded Algorithms

T



A

p1

q1

"x

r1

!x

x

merge

… p3

"x



copy

x q3

p2

q2

!x

= -r }. Then the subgraph induced by V r) is connected. Similarly, for a real number r 0 } and N v V" Yv < 0 } is a connected subgraph of G. The eigenvectors of the adjacency matrix corresponding to its algebraically largest eigenvalues have also been used to partition graphs. It is of interest to ask if a similar theorem holds for an eigenvector corresponding to the second largest eigenvalue of the adjacency matrix.

Downloaded 05/26/21 to 143.215.38.14. Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

PARTITIONING SPARSE MATRICES

433

Let x, y denote eigenvectors corresponding to the algebraically largest and second largest eigenvalues, respectively, of the adjacency matrix of G. By the Perron-Frobenius theory, it is known that all components of x are positive. Fiedler’s theorem states that if a is a nonnegative number, then the subgraph induced by v V: yo + axo >= O }

V

is connected. Similarly, if a is a nonpositive number, then the subgraph induced by v v: y Ilx --< 0 is also connected. v2 Alon [1] and Mohar [44] have studied the relationship of the second Laplacian eigenvalue to the isoperimetric number, i(G). If U is a subset of the vertices of the graph G, and 6U denotes the set of edges with one endpoint in U and the other in V\ U, then

i(G)

min I1-/

IUI UI

Clearly i(G) is related to the problem of computing good edge separators. Alon, Galil, and Milman [2], [3] have related the second Laplacian eigenvalue to the expansion properties of graphs. The relationship of the Laplacian spectrum to several other graph properties has been considered by several authors; two recent survey articles are by Mohar [45] and Bien [10]. Spectral methods for computing edge separators have been considered by several researchers: Donath and Hoffman 16 ], 17 ], Barnes 7 ], 8 ], Barnes and Hoffman 9 ], Boppana 12]. An algorithm for coloring a graph by employing the eigenvectors of the adjacency matrix has been considered by Aspvall and Gilbert [6 and a spectral algorithm for finding a pseudoperipheral node has been described by Grimes, Pierce, and Simon [33]. A spectral algorithm for envelope reduction is considered in [53]. Algorithms that make use of flows in networks to compute separators have been designed by Bui et al. 13 ], and Leighton and Rao 37 ]. The former describes a bisection algorithm with good average-case behavior for degree-regular random graphs, and the latter describes an approximation algorithm for minimum quotient edge separators.

3. Lower bounds. We obtain lower bounds on the sizes of vertex separators in terms of the eigenvalues of the Laplacian matrix Q(G) in this section. The lower bounds hold for any vertex separator in the graph; in particular, these bounds apply to a smallest separator in the graph. We assume that the graph G is connected. Let G (V, E) denote a graph on VI n vertices, and let A be a subset of its vertices. Denote by 19(v, A the distance of a vertex v from A, i.e., the fewest number of edges in a shortest path from v to a vertex in A. Let S denote the set of vertices which are at a distance of less than 19 >_- 2 from A, and not belonging to A. Hence

,

Define B V\(A (.J S); if B 4: then the distance between A and B, 19(A, B) t9. If 19 > 2, the set S is a wide separator that separates A from B. If 19 2, we get the commonly used notion of separators. Wide separators were first used in sparse matrix algorithms by Gilbert and Schreiber [28]. Let EA denote the set of edges with both endpoints in A, and EAS denote the set of edges with one endpoint in A, and the other in S. The sets E, Es, and Es are defined similarly. In the following, it will be convenient to work with the fractional sizes a B I/n, and s A l n, b SI/n. The degree of a vertex v will be denoted by d(v), and A will denote the maximum degree of vertices in G.

Downloaded 05/26/21 to 143.215.38.14. Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

434

A. POTHEN, H. D. SIMON, AND K.-P. LIOU

The first result is a lower bound on the size of a wide separator separating any pair of vertex disjoint sets A and B that are at a distance o from each other. As will be described later, it generalizes a result of Alon, Galil, and Milman [2]. THEOREM 3.1. Let A, B be disjoint subsets of vertices of G that are at a distance > 2 o from each other. Let S denote the set of vertices not belonging to A that are at a distance less than o from A. Then

sZqt-s--pZa(1--a)>=O, where=(A/)z)+pZa 1.

Proof. Let e, 0 be the vector of all ones and all zeros, respectively. The CourantFischer-Poincar6 minimax principle states that

:z= min tQs xtx x4:0 0

etx

min

Ei,j),(xi-x) 2

, ,_

x_4:O

etx 0

2

lXi

Using the Lagrange identity in the above equation, Fiedler [24] derived the following inequality, which is valid for all real n-vectors.

Z

(Xi--Xj)2)k2

n E

(i,j)

(Xi--Xj) 2"

i,j V i-_ s2, and s2

If/3

>= 2o(a(1

(- + (/32 + 4o2a(

a))l/2).

a)) 1/2, then expanding the right-hand side in power series yields the

result.

It remains to verify the condition of the corollary. Since (a( a)) / has its maximum value 1/2 when 0 _-< a =< 1, the power series expansion is valid when The corollary exhibits the dependence of vertex separator sizes on X: the smaller the second eigenvalue, the larger the ratio A/X2, and the smaller the lower bound on the vertex separator size. The corollary also shows the dependence of the lower bound on the distance 0 and the fractional size of the set A. The common situation of a separator corresponds to o 2. In this case, the quadratic inequality becomes s 2 + /s 4a(1 a) _>- 0, with/3 (zX/X2) + 4a 1. After some simplification, it can be seen that the inequality in Theorem 2.1 of Alon, Galil, and Milman 2 is equivalent to the above inequality. In this case, when/3 >= 2, we obtain the lower bound S>

4a(l-a) (A/X2) +4a-

Mohar [43, Lem. 2.4] has obtained a lower bound on vertex separators in terms of

Xn and X. Lower bounds on edge separators can also be obtained by this technique. A second lower bound. We now obtain a lower bound that exhibits another factor influencing the size of vertex separators. The technique used is derived from the WielandtHoffma theorem, and has been previously used by Donath and Hoffman 17 to obtain lower bounds on edge separators. Let S be a vertex separator that separates the graph G into two sets A and B, with AI >-- BI >= SI. Let d(v) denote the degree of a vertex v, and let i(v) denote the "internal" degree of v, i.e., the number of edges incident on v with the other endpoint in the same set as v. Recall that the eigenvalues of Q are ordered as Xl 0 < X _-< X3 =< X,. Let the n n matrix J diag (Ja, J6, Jc), where Ja is the n a n a matrix of all ones, and J6, J are similarly defined. The eigenvalues of J are #l na >= t nb >= #3 ns > ld4

#n=0.

436

A. POTHEN, H. D. SIMON, AND K.-P. LIOU

THEOREM 3.3. Let S be a vertex separator that divides a graph G into two parts, >--IBI >- ISI. Then

Downloaded 05/26/21 to 143.215.38.14. Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

A, B, with I/I

s>= 2A

a)X2 ()‘3--)‘2)

Proof. From the proof of the Wielandt-Hoffman theorem

34 (see also 17 ),

trace QJ) >-

(4)

i=1

We now compute both sides of the above inequality. The fight-hand side is

Xi#i=na’O+nb’Xz+ns’X3=n(1-a-s)X2+nsX3. i=1

To evaluate the left-hand side, we partition the symmetric matrix Q to conform to J: Qaa

Q=

o

0 Qas Qbb Qbs

Qtas Qtb, Qss trace QJ)

(5)

t

trace (Q,J,) + trace (Q,bJ,) + trace (QssJs)

--2(IEI-IEAI--IEl-IEsl) 2(I gl

gl

gl)

= n( 1--a--s)X2+ns)‘3. rq This yields the final result after some rearrangement, This last lower bound on a vertex separator size shows as before that the magnitude of )‘2 influences the lower bound; it also shows that the "gap" between ),3 and )‘2 has an effect. A word of caution is in order about these lower bounds. These bounds should be considered the same way one treats an upper bound on the error in an a priori roundoff error analysis [58 ]. The lower bounds obtained are not likely to be tight, except for particular classes of graphs. They do illustrate, however, that a large )‘2, with an accompanying small A/)‘2, will result in large sizes for the best separators in a graph.

4. Partitions of grid graphs. In this section we show that the second eigenvector of the Laplacian matrix can be used to find good vertex separators in grid graphs, which are model problems in sparse matrix computations. The separators obtained are identical to the separators used by George [26] at the first step in a nested dissection ordering of grid graphs. To compute separators by this technique, we need to first compute the eigenvectors of grids. The Laplacian spectra of grid graphs can be explicitly computed in terms of the

Downloaded 05/26/21 to 143.215.38.14. Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

PARTITIONING SPARSE MATRICES

437

Laplacian spectra of path graphs. Some of this material is well known in spectral graph theory [15], but such treatments consider only eigenvalues and not eigenvectors. Further, the nine-point grid needs to be modified before its spectrum can be explicitly computed. The techniques used are quite general, and can be used to compute the spectra of several other classes of graphs which can be expressed in terms of graph products of simpler graphs. The path graph. Let Pn denote the path graph on n vertices. We assume in the following discussion that n >= 2 is even. We number the vertices of the path from to n in the natural order from left to fight. The Laplacian matrix of Pn is tridiagonal, and hence its spectrum is easily computed. Let ff)n 7l’/n. We denote the elements of a vector s by writing its th component as (xi). LEMMA 4.1. The Laplacian spectrum of Pn is

Xk,n =4sin 2

(1(k-1)4n

,n, ,n. V1 l, xk,,=(cos((i -1/2)(k-1)4)), fork= l, As k ranges from to n, the angle 1/2 (k )4, varies from zero to r/2; hence the < < ordered are as Xl, )t2 n eigenvalues n Note that X l, 0, X ,n 1, and X2,n 4 sin2((hn/2), and _x,. (cos((/- 1/2 )qS)). The components of_x2,, plotted against the vertices of P30 decrease monotonically from left to right. Let Xl denote the median n / 2th largest) component of the second eigenvector, and partition the vertices of the path into two sets, one set consisting of all vertices with components less than or equal to the median component, and the other consisting of all vertices with components larger than the median component. This partitions the path into subsets of vertices of equal size, one consisting of the vertices with positive eigenvector components, and the other consisting of vertices with negative components.

Graph products. We can compute the spectra of grid graphs from the spectra of the path graph. We require the concepts of graph products and the Kronecker products of matrices. One notation for graph products is from Cvetkovic, Doob, and Sachs [15 ], and a good discussion of Kronecker products may be found in Fiedler [25]. For 1, 2, let Gi (Vi, Ei) be graphs. The Cartesian sum G1 nt- G2 is the graph (V1 V2, E), where vertices il, jl and i2, j.) are joined by an edge if either i i2 and { j, j2 is an edge in G2, or j j2 and { i, i2 is an edge in G1. The Cartesian product Gl" G2 is the graph (V1 V2, F), where vertices (il, j and i2, j2) are joined by an edge if { il, i2 is an edge in G and { j, j2 is an edge in G. The strong sum G (R) G2 is the graph (V1 V2, E U F); thus it contains the edges in both the Cartesian sum and the Cartesian product. It is easy to verify that the Cartesian sum P + Pm is the five-point m n grid graph, and that the strong sum P Pm is the nine-point m n grid graph. Since the grid graphs can be obtained from appropriate graph products of the path graph, the Laplacian matrices of the grid graphs can be obtained from Kronecker products involving the Laplacian matrices of the path graph. If C is a p q matrix, and D is r s, recall that the Kronecker product C (R) D is the pr qs matrix with each element do of D replaced by the submatrix (Cdij). The five-point grid. We consider the m n five-point grid, and without loss of _-< consider the consider we case when n is even, and m < n. At m n. Initially generality the end of this section, we discuss how the results are modified when n is odd, or m n. We draw the m n grid with n vertices in each row and m vertices in each column.

Downloaded 05/26/21 to 143.215.38.14. Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

438

A. POTHEN, H. D. SIMON, AND K.-P. LIOU

Let Q denote the Laplacian matrix of the five-point m n grid graph, Rn denote the Laplacian matrix of the path graph on n vertices, and In be the identity matrix of order n. Recall that k,n, Xk, denotes the kth eigenpair (when eigenvalues are listed in increasing order) of the path graph with n vertices. The following result is well-known; we include a proof for completeness, and because we wish to indicate how a similar result is obtained for the Laplacian spectrum of a modified nine-point grid. THEOREM 4.2. The Laplacian spectrum of the m n five-point grid is

kk,n d- kl, m, y__k,l=Xk, n( Xl, m,

Idk,

k-- 1,

,n,

1,

,m.

Proof. It is easy to verify that the Laplacian matrix of the five-point grid can be expressed in terms of the Laplacian matrix of the path graph as Q Rn (R) Im + In (R) Rm. The first term in the sum creates m copies of the path on n vertices, and the second term adds the "vertical" edges, which join neighboring vertices in each column of the grid. We show that zk, t, Y,t is an eigenpair of Q. Q Xk, n () Xl,

Rn (R) Im

Xk,n () Xl,

41- (In () Rm Xk, n ( Xl,

(RnXk,n)()(ImXl, m) -1- (InXk, n)((RmXl, m) k,n__k,n () Xl, m d- Xk, n () ,l, mXl, m

Ik,

"l- 11,m X__k,n () Sl,

The transformation from the first line to the second line uses the associativity of the D Kronecker product. The smallest eigenvalue tl, l,n + X,m is zero. The next smallest eigenvalue is 4 sin 2 (n/2), and the corresponding eigenvector is /2, ___.Y2,1

X2,n (Xl,m

COS

i-

qn

(R) I.

The components of Y2, are constant along each column of m vertices, and the components decrease from left to fight across a row. Columns numbered to n/2 have positive components, and the rest of the columns have negative components. The components of this eigenvector of the m n five-point grid are plotted in Fig. 1. These results show that the second eigenvector of the grid can be used to compute good edge separators and vertex separators. Let y denote this eigenvector in the following discussion, and let y denote the median component ((mn/2)th largest component out of mn). Let Yo denote the eigenvector component corresponding to vertex v. COROLLARY 4.3. Let V denote the set of vertices of the five-point m n grid m < n, n even), and let V be partitioned by its second eigenvector as follows"

B’=V\A. A’={v’y14GB), which causes the failures in SSSP (Figure 2b). Async’s poor performance is due to lock contention, the lack of message batching, and communication overheads in distributed locking. Specifically, some machines, having completed execution on their portion of the graph, will initiate a termination detection algorithm only to find other machines that are not done. After a while, they reinitiate the algorithm, only to cancel again. This repeated cycling between starting and cancelling termination is expensive and can result in substantial overheads. Async’s network I/O is also highly variable due to its non-deterministic execution. Given GraphLab’s excellent synchronous performance, there is little reason to run PageRank or SSSP asynchronously for the datasets we have tested: while async can provide faster convergence on a single machine [28], the overheads, poor scalability, and difficulty in tuning distributed locking prevent a net performance gain. Despite its overheads, async may perform better in specific cases where BSP is inefficient, such as SSSP on graphs with very large diameters. Finally, async’s poor network scalability indicates that it can achieve better performance on a small number of powerful machines rather than a large cluster of weaker machines.

6.6

Results for LJ and OR Datasets

The smaller LJ and OR graphs share similar trends with the larger graphs. For GPS, its poor setup time performance is more evident: 128 machines take 10× longer than 16 machines on LJ. Furthermore, GPS’s optional optimizations

1056

Table 5: Our experiences with each system. Giraph GPS Mizan GraphLab Dev. Support Usability

still provide little benefit on LJ and OR, while for DMST, GPS’s superstep overheads result in computation times that are up to twice as long as Giraph hash map. Similarly, Mizan’s scalability issues remain evident on SSSP and WCC. Lastly, GraphLab’s asynchronous mode fails on 128 machines for both LJ and OR, on both PageRank and SSSP, due to a substantial increase in memory usage. However, this is easy to avoid as LJ and OR run fine on fewer machines.

7.

EXPERIENCES

In this section we highlight our experiences with each system, particularly from an end-user’s point of view. We consider two aspects: development support (dev. support), which is how user-friendly a system is to algorithm developers in terms of API, documentation, and so forth; and usability, how easy it is to use and monitor the system and its progress, and how effectively errors or crashes are displayed. A high-level ranking is provided in Table 5.

7.1

Giraph

Given its large developer base and rigorous coding style requirements, Giraph is the easiest system to understand and code for. In contrast, neither GPS nor Mizan are nearly as well-documented, requiring us to contact the authors for help in understanding certain system components. Giraph’s well-documented API allowed us to identify potential pitfalls without needing to understand the implementation details. Depending on the use case, Giraph’s use of Java and its tight integration with Hadoop, HIVE, and so on can be another advantage. Structurally, Giraph is designed in a modular manner, allowing custom input, output, and data type formats. These data types must be specified as command-line arguments, which can initially be daunting. However, this setup provides unparallelled flexibility, especially compared to the limited input and output options of GPS and Mizan. Changing the input format is also easy in GraphLab but, from our limited experience, the implementation is less modular. Finally, Giraph has excellent usability due to its use of the Hadoop web monitoring interface. When an experiment runs into trouble, it is evident in both the console output and on the web interface. Additionally, Giraph outputs log files, which are useful for debugging and monitoring progress.

7.2

GPS

GPS has a similar API to Giraph, making it easy to understand and to port code over. While we did not develop as extensively in GPS or have the opportunity to work with its global memory map (GPS’s version of aggregators), we found its API documentation to be lacking compared to Giraph and GraphLab. Finally, unlike Giraph, custom data types in GPS are hard coded as a job configuration class for each algorithm. While not as flexible as Giraph’s commandline approach, it provides the same modularity.

For usability, GPS provides an excellent, albeit poorly documented, web interface that tracks many detailed statistics. However, GPS does have a few issues that make experimentation difficult. First, communication teardown is not robust: GPS workers and master do not free their ports immediately, forcing a minimum wait time of a minute between runs. Without this, connections will fail to establish on the next run, as ports are hard-coded in a configuration file. Second, failures or errors are not prominently displayed on the web interface and do not appear at all in the console. In particular, failed computations never timeout.

7.3

Mizan

Mizan5 is still very much an experimental system compared to Giraph and GraphLab. It has many core features implemented but still has a few bugs and is missing several useful features. For example, it currently assumes that vertex values and edge weights are of the same type. This makes DMST consume more memory as every edge must store 5 fields rather than 3. The inability to read in edge weights from an input graph is also a barrier to testing more complex inputs, such as graphs with randomly generated unique edge weights used for DMST. Finally, Mizan does not yet support master.compute(), which is also useful for DMST. As Mizan is written in C++, there is less run-time checking than Java. In combination with Mizan’s API design, algorithm implementations are more susceptible to memory errors than in Giraph, GPS, and GraphLab. Sparse API documentation means identifying major pitfalls require understanding the internal code. For example, deleting or changing a vertex value immediately frees the previous underlying object, despite the original allocation occurring in the user’s algorithm code. In contrast, both Giraph and GraphLab clearly document unsafe operations. Having such warnings, or a more robust API, would make development much easier. Lastly, differing coding styles in Mizan mean strange bugs can arise due to subtle inconsistencies. These bugs make it difficult to debug Mizan’s system on our own, and are exacerbated by the fact that Mizan does not perform logging. For usability, Mizan provides only console output but it contains sufficient information about where the computation is. However, the console output suffers from serialization issues, so text gets clobbered when using many workers. Finally, unlike GPS, failures are immediate and do not hang.

7.4

GraphLab

While we did not develop algorithms in GraphLab, we found that its source code and API are well-documented and easy to understand. GraphLab makes good use of C++ features and, unlike Mizan, there is less confusion about memory allocation and deallocation. Like Giraph, GraphLab also has an active developer and user community. Similar to Mizan, GraphLab outputs only to the console and failures do not hang. While it does not log to files like Giraph and GPS, it can output different levels of information for debugging. Unlike Giraph and GPS, GraphLab provides parallel input loading only when the input graph file is manually split into disjoint parts. Supporting parallel loading for a single input file would help improve usability. Finally, like GPS and Mizan, GraphLab is not as tightly integrated with Hadoop as Giraph. 5

1057

We thank Mizan’s authors for providing several bug fixes.

8.

CONCLUSION

Graph processing systems are increasingly important as more and more problems require dealing with graphs. To this end, we presented a thorough comparison of four recent graph processing systems, Giraph, GPS, Mizan, and GraphLab, on five datasets and four different algorithms: PageRank, SSSP, WCC, and DMST. We used 16, 32, 64, and 128 Amazon EC2 instances with computation time, setup time, memory usage, and network I/O as our metrics. We found that Giraph and GraphLab’s synchronous mode have good all-around performance, while GPS excels at memory efficiency. We also found that Giraph, GPS, and GraphLab’s synchronous mode outperform Mizan in all experiments. We identified Giraph hash map as a better choice than Giraph byte array for graph mutations. We found that GPS’s LALP and dynamic migration optimizations provide little performance benefit, and that GraphLab’s asynchronous mode has poor scalability and performance due to communication overheads. Finally, for each system, we identified potential areas of improvement: for Giraph, better workload balancing to reduce maximum memory usage, and a need for adjacency list data structures that are both mutation and memory efficient; for GPS, exploiting data locality to improve the scalability of setup times, and avoiding message polling to minimize superstep overheads; for Mizan, adding system and message processing optimizations to improve performance and scalability; and, for GraphLab, reducing communication overheads for its asynchronous mode.

9. [1] [2] [3] [4]

[5] [6]

[7]

[8]

[9]

[10] [11]

[12]

[13]

[14]

REFERENCES Apache Giraph. http://giraph.apache.org. Apache Hama. http://hama.apache.org. GraphLab. http://graphlab.org. J. Alpert and N. Hajaj. We knew the web was big... http://googleblog.blogspot.ca/2008/07/ we-knew-web-was-big.html, 2008. K. Andreev and H. R¨ acke. Balanced graph partitioning. In SPAA ’04, pages 120–124, 2004. S. Austin. Facebook Passes the 900 Million Monthly Users Barrier. http: //blogs.wsj.com/digits/2012/04/23/facebook-passes, 2012. P. Boldi, B. Codenotti, M. Santini, and S. Vigna. UbiCrawler: A Scalable Fully Distributed Web Crawler. Software: Practice & Experience, 34(8):711–726, 2004. P. Boldi, M. Rosa, M. Santini, and S. Vigna. Layered Label Propagation: A MultiResolution Coordinate-Free Ordering for Compressing Social Networks. In WWW ’11, pages 587–596, 2011. P. Boldi and S. Vigna. The Webgraph Framework I: Compression Techniques. In WWW ’04, pages 595–602, 2004. A. Ching. Scaling Apache Giraph to a trillion edges. http://www.facebook.com/10151617006153920, 2013. S. Chung and A. Condon. Parallel Implementation of Borvka’s Minimum Spanning Tree Algorithm. In IPPS ’96, pages 302–308, 1996. M. Ciglan, A. Averbuch, and L. Hluchy. Benchmarking Traversal Operations over Graph Databases. In ICDEW 2012, pages 186–189, 2012. T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. The MIT Press, 3rd edition, 2009. E. David and K. Jon. Networks, Crowds, and Markets: Reasoning About a Highly Connected World. Cambridge University Press, 2010.

[15] D. Donato, L. Laura, S. Leonardi, and S. Millozzi. Simulating the Webgraph: A Comparative Analysis of Models. Computing in Science and Engineering, 6(6):84–89, 2004. [16] B. Elser and A. Montresor. An evaluation study of BigData frameworks for graph processing. In IEEE Big Data 2013, pages 60–67, 2013. [17] R. G. Gallager, P. A. Humblet, and P. M. Spira. A Distributed Algorithm for Minimum-Weight Spanning Trees. ACM Trans. Program. Lang. Syst., 5(1):66–77, 1983. [18] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. PowerGraph: Distributed graph-parallel computation on natural graphs. In OSDI ’12, pages 17–30, 2012. [19] Y. Guo, M. Biczak, A. L. Varbanescu, A. Iosup, C. Martella, and T. L. Willke. How Well do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis. In IPDPS 2014, 2014. [20] U. Kang, C. E. Tsourakakis, and C. Faloutsos. PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations. In ICDM ’09, pages 229–238, 2009. [21] Karypis Lab. METIS and ParMETIS. http://glaros.dtc.umn.edu/gkhome/views/metis. [22] Z. Khayyat. Personal correspondence, Apr 2014. [23] Z. Khayyat, K. Awara, A. Alonazi, H. Jamjoom, D. Williams, and P. Kalnis. Mizan: A system for dynamic load balancing in large-scale graph processing. In EuroSys ’13, pages 169–182, 2013. [24] J. M. Kleinberg. Authoritative Sources in a Hyperlinked Environment. J. ACM, 46(5):604–632, 1999. [25] A. Kyrola, G. Blelloch, and C. Guestrin. GraphChi: Large-scale Graph Computation on Just a PC. In OSDI ’12, pages 31–46, 2012. [26] G. Laporte. The Vehicle Routing Problem: An overview of exact and approximate algorithms. European Journal of Operational Research, 59(3):345–358, 1992. [27] Y. Low. Is Chromatic Engine still supported. http:// groups.google.com/d/topic/graphlab-kdd/2T0CjLodHFc/, 2013. [28] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. Proc. VLDB Endow., 5(8):716–727, 2012. [29] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A system for large-scale graph processing. In SIGMOD/PODS ’10, pages 135–146, 2010. [30] N. Przulj. Protein-protein interactions: Making sense of networks via graph-theoretic modeling. BioEssays, 33(2):115–123, 2011. [31] S. Salihoglu and J. Widom. GPS: A Graph Processing System. Technical report, Stanford, 2012. [32] S. Salihoglu and J. Widom. GPS: A Graph Processing System. In SSDBM ’13, pages 22:1–22:12, 2013. [33] S. Salihoglu and J. Widom. Optimizing Graph Algorithms on Pregel-like Systems. Technical report, Stanford, 2013. [34] Z. Shang and J. X. Yu. Catch the Wind: Graph Workload Balancing on Cloud. In ICDE ’13, pages 553–564, 2013. [35] Y. Tian, A. Balmin, S. A. Corsten, S. Tatikonda, and J. McPherson. From “Think Like a Vertex” to “Think Like a Graph”. VLDB ’13, 7(3):193–204, 2013. [36] Y. Tian, R. A. Hankins, and J. M. Patel. Efficient aggregation for graph summarization. In SIGMOD/PODS ’08, pages 567–580, 2008. [37] L. G. Valiant. A Bridging Model for Parallel Computation. Commun. ACM, 33(8):103–111, 1990. [38] R. Ward and T. Molteno. Table of Linear Feedback Shift Registers. http://www.eej.ulst.ac.uk/~ian/modules/ EEE515/files/old_files/lfsr/lfsr_table.pdf, 2007.

1058