Pattern Name: SeparableDependencies |
AlgorithmStructure Design Space |
This pattern is used for task-based decompositions in which the dependencies between tasks can be eliminated as follows: Necessary global data is replicated and (partial) results are stored in local data structures. Global results are then obtained by reducing (combining) results from the individual tasks.
In general, task-based algorithms present two distinct challenges to the software designer: allocating the tasks among the processors so the computational load is evenly distributed; and managing the dependencies between tasks so that if multiple tasks update the same data structure, these updates do not interfere with each other.
This pattern represents an important class of problems in which these two issues can be separated. In these problems, dependencies can be pulled outside the set of concurrent tasks, allowing the tasks to proceed independently.
Consider an example, the classic N-body problem: A system contains N bodies that move in space, each exerting distance-dependent forces on each of the other N-1 bodies. The problem is to calculate the motion of the bodies. For each instant in time, each body has a position, a velocity, and a force vector. For each time instant, a sequential solution calculates the force vector incident on each body by summing the force contributions from each of the other N-1 bodies and then computing the new position and velocity of each particle using its force vector.
One way to parallelize the problem is to decompose the computation performed at each time instant into tasks such that each task is responsible for computing the position and velocity of a subset of the bodies along with the contribution of the forces from that subset on the rest of the system. These tasks are not independent, since each task needs to read the locations of the other bodies, and each task must update the force vectors of all the other bodies.
This problem has two features that can be exploited. First, during the calculation for a particular time instant, the location of each body is first read by the other tasks and then modified by only a single task; it is not read by any other task after it has been written. Therefore, dependencies between tasks involving the location data can be eliminated by replicating this data in all the tasks. Second, since the force vectors are the sums of forces due to each body, they can be computed in two stages as follows: Each task can compute a partial sum, placing the result in a task-local variable. Once all the tasks have computed their partial sums, these partial sums can then be summed (reduced) to give the desired local force vectors. As a result, all the dependencies between the tasks during the concurrent execution have been "pulled out" of the concurrent part of the computation.
The techniques described in the EmbarrassinglyParallel pattern can be applied to the now-independent tasks. Some dependencies between tasks can be eliminated by replacing global data structures with copies local to each UE. (In a shared-memory environment, it is possible for all tasks that do not modify the global data structure to share a single copy.) Others may be eliminated by writing results to a local object and then, in a logically separate step, reducing (merging) the local objects into a single object. In essence, the dependencies between tasks are eliminated from the concurrent part of the computation, thus making the construction of a parallel algorithm much simpler. The following figure illustrates the central idea of this pattern.
This pattern can be used when:
result = combine(v0 ,v1,...,vM-1)
where the computation of vi is computed independently by task i, possibly after replicating global data. The object can thus be treated as a "reduction variable" (using the term "reduction" somewhat more broadly than usual), with each task computing a local partial result and these partial results being subsequently combined into a final global result.
The pattern is especially effective when:
Implementations of this pattern include the following key elements:
When this pattern is used in a parallel algorithm, it usually drives the top-level organization of the parallel algorithm. It frequently appears, as in the N-body example, as the body of a compute-intensive loop within a program. Often, but not necessarily, the reduction step uses one of a small set of standard commutative and associative reduction operators (+, *, logical and, logical or, bitwise and, bitwise or, min, max).
This pattern shows up in many forms.
Before describing the implementation of this pattern, we remark that it is useful to remember that what we are essentially doing is isolating the dependencies so the concurrency becomes embarrassingly parallel.
A set of tasks is represented and scheduled for execution on multiple units of execution (UEs). Usually, the tasks correspond to iterations of a loop. In this case we implement this pattern by splitting the loop between multiple UEs. The key to making algorithms based on this pattern run well is to schedule their execution so the load is balanced between the UEs. The approaches used for this scheduling are the same as those described in the Implementation section of the EmbarrassinglyParallel pattern.
As discussed previously, the tasks cooperatively update one or more objects, and this pattern is not applicable if values written to such an object by one task are subsequently read by another. Once these objects have been identified, local copies must be created and initialized. It is simplest to think of each task as having its own copy of these shared objects, with their values initialized to the objects' initial values during task initialization and updated as the task performs its work. Often, the local update will in fact be a "local reduction". For example, a task in the N-body problem may handle several bodies; within the task, the local force vector updates would in fact be a reduction of the force contributions for the subset of the bodies handled by that task.
In practice it may be possible to reduce the number of physical copies required, particularly in shared-memory environments, since tasks that do not update a particular object can share a single copy of the object. In a shared-memory environment, an object that is read by multiple tasks but updated by only one need only be duplicated, with one copy maintaining its initial value (to be used by all tasks that do not modify the object) and one copy being updated by the single task that modifies the object.
Also, in practice it is usually enough to have one local copy per unit of execution (UE) rather than one local copy per task.
The reduction step occurs after all the tasks have finished updating their local copies of the data structure. The first step is thus to determine that all partial results have been computed. In its most general form, this is almost the same as detecting termination in the EmbarrassinglyParallel pattern. The difference is that the tasks should have finished the update of the global data structure but may not have actually terminated, since they may still need to participate in the reduction operation.
The reduction step itself performs the calculation indicated earlier (result = combine(v0 ,v1,...,vM-1)). In most cases the combine function can be expressed in the form (v0 · v1 · ... · vM-1), where · is a binary operator. Computations of this form are sufficiently common in parallel algorithms that they are discussed separately in the Reduction pattern; many programming environments also supply high-quality implementations of this pattern.
This example, taken from the MPI reference manual [Snir96], uses this pattern to compute the product of a vector and a matrix. A simple sequential program to accomplish this result is as follows:
SUBROUTINE BLAS2(M,N,a,b,c) REAL a(M), b(M,N) !input vector and matrix REAL c(N) !result INTEGER M,N,i,j DO j = 1,N c(j) = 0.0 DO i = 1,M c(j) = c(j) + a(i)*b(i,j) END DO END DO RETURN
Each element of the result vector is a reduction of M partial sums, so the SeparableDependencies pattern applies -- we can calculate each element by computing and then combining partial sums, with the decomposition based on partitioning the input vector and matrix as shown in the following figure (shaded areas indicate data for one task).
To calculate the matrix-vector product, each UE computes a local sum (the product of its section of vector a and its section of matrix b); final values for the elements of product vector c are obtained by summing these local sums.
This example performs numerical integration using the trapezoid rule. This example is almost trivial, but it effectively addresses most of the key issues raised by this pattern. The goal here is to compute pi by integrating the function over the interval from 0 to 1. But what is important here is not the mathematics but the pattern, which consists of computing a sum of individual contributions and which is parallelized by computing a number of partial sums and then combining them to get the global result.
First, here's a simple sequential version of a program that solves this problem:
#include <stdio.h> static long num_steps = 100000; double step; void main () { int i; double step, x, pi, sum = 0.0; step = 1.0/(double)(num_steps); for( i = 1; i < num_steps; i++){ x = (i-0.5) * step; sum = sum + 4.0/(1.0 + x*x); } pi = sum * step; printf(" pi is %f \n",pi); }
We present two parallel implementations.
This pattern has been used extensively in computational chemistry applications. The parallel computational chemistry book [Mattson95b] includes several chapters that discuss the use of this pattern. In particular:
Data-parallel algorithms make heavy use of the SeparableDependencies pattern. While traditionally intended for SIMD computers, data-parallel algorithms can be implemented on other architectures as well. Several data-parallel algorithms, including some that would be suitable for reduction with associative operators, sorting algorithms, and algorithms on linked lists are described in [Hillis86].
There are also a number of very interesting sorting and other algorithms in which the "split" phase of the pattern is non-trivial. Such algorithms can be found in [JaJa92].
This pattern converts into the EmbarrassinglyParallel pattern when the dependencies are removed.
If the dependencies involve reads as well as writes, the pattern is transformed into the ProtectedDependencies pattern.