SeparableDependencies Design Pattern

Pattern Name: SeparableDependencies

AlgorithmStructure Design Space

Intent:

This pattern is used for task-based decompositions in which the dependencies between tasks can be eliminated as follows: Necessary global data is replicated and (partial) results are stored in local data structures. Global results are then obtained by reducing (combining) results from the individual tasks.

Also Known As:

Replicated data, loop splitting [Mattson96].

Motivation:

In general, task-based algorithms present two distinct challenges to the software designer: allocating the tasks among the processors so the computational load is evenly distributed; and managing the dependencies between tasks so that if multiple tasks update the same data structure, these updates do not interfere with each other.

This pattern represents an important class of problems in which these two issues can be separated. In these problems, dependencies can be pulled outside the set of concurrent tasks, allowing the tasks to proceed independently.

Consider an example, the classic N-body problem: A system contains N bodies that move in space, each exerting distance-dependent forces on each of the other N-1 bodies. The problem is to calculate the motion of the bodies. For each instant in time, each body has a position, a velocity, and a force vector. For each time instant, a sequential solution calculates the force vector incident on each body by summing the force contributions from each of the other N-1 bodies and then computing the new position and velocity of each particle using its force vector.

One way to parallelize the problem is to decompose the computation performed at each time instant into tasks such that each task is responsible for computing the position and velocity of a subset of the bodies along with the contribution of the forces from that subset on the rest of the system. These tasks are not independent, since each task needs to read the locations of the other bodies, and each task must update the force vectors of all the other bodies.

This problem has two features that can be exploited. First, during the calculation for a particular time instant, the location of each body is first read by the other tasks and then modified by only a single task; it is not read by any other task after it has been written. Therefore, dependencies between tasks involving the location data can be eliminated by replicating this data in all the tasks. Second, since the force vectors are the sums of forces due to each body, they can be computed in two stages as follows: Each task can compute a partial sum, placing the result in a task-local variable. Once all the tasks have computed their partial sums, these partial sums can then be summed (reduced) to give the desired local force vectors. As a result, all the dependencies between the tasks during the concurrent execution have been "pulled out" of the concurrent part of the computation.

The techniques described in the EmbarrassinglyParallel pattern can be applied to the now-independent tasks. Some dependencies between tasks can be eliminated by replacing global data structures with copies local to each UE. (In a shared-memory environment, it is possible for all tasks that do not modify the global data structure to share a single copy.) Others may be eliminated by writing results to a local object and then, in a logically separate step, reducing (merging) the local objects into a single object. In essence, the dependencies between tasks are eliminated from the concurrent part of the computation, thus making the construction of a parallel algorithm much simpler. The following figure illustrates the central idea of this pattern.

Applicability:

This pattern can be used when:

The problem is represented as a collection of concurrent tasks.
Dependencies between the tasks satisfy the restriction that, for any shared object, one of the following holds:

Only one task modifies the object, and other tasks need only its initial value. The object can thus be replicated. (Observe that in a shared-memory environment all tasks that do not modify the object can share a single copy of the object.)
The object's final value can be formulated as
- result = combine(v₀ ,v₁,...,v_M-1)
where the computation of v_i is computed independently by task i, possibly after replicating global data. The object can thus be treated as a "reduction variable" (using the term "reduction" somewhat more broadly than usual), with each task computing a local partial result and these partial results being subsequently combined into a final global result.

The pattern is especially effective when:

result = (v₀ · v₁ · ... · v_M-1) where · is an associative operator (i.e., the result can be computed as a reduction, using the term in its usual sense). This fact can be exploited to improve performance in the implementation of the combining (reduction) step, since the order in which the operations are performed does not matter. The pattern is even more effective when the reduction operator · is both associative and commutative.

Structure:

Implementations of this pattern include the following key elements:

A mechanism to define a set of tasks and schedule their execution onto a set of units of execution (UEs).
Definition and update of a local data structure, possibly containing replicated data.
Combination (reduction) of the many local objects into a single object.

Usage:

When this pattern is used in a parallel algorithm, it usually drives the top-level organization of the parallel algorithm. It frequently appears, as in the N-body example, as the body of a compute-intensive loop within a program. Often, but not necessarily, the reduction step uses one of a small set of standard commutative and associative reduction operators (+, *, logical and, logical or, bitwise and, bitwise or, min, max).

Consequences:

This pattern shows up in many forms.

In most cases, the pattern makes the processing of the concurrent tasks very similar to the embarrassingly parallel case. Hence, the automatic dynamic load balancing that is the hallmark of a good embarrassingly parallel algorithm applies to SeparableDependencies algorithms as well.
Although the concurrent tasks are independent, the reduction step itself involves intertask communication and synchronization. This may be very expensive, especially when the number of UEs is large. Thus, the challenge in using this pattern is in ensuring that the cost of the reduction step does not overwhelm the computation done by the tasks. Properties of the reduction operator, such as associativity and commutativity, can be exploited to obtain more efficient implementations of the reduction step. Fortunately, because of the importance of this pattern, most programming environments supply high-quality implementations of reduction using the standard reduction operators. Some also allow user-defined reduction operators.
In distributed-memory environments, objects that are read or modified by multiple tasks are replicated in each task. If the size of the object is very large, this can result in very large memory requirements.
When using this pattern with floating-point merge operations, it is important to remember that floating-point arithmetic is not associative, and thus the final results can depend on the details of the implementation of the reduction operation. With a well-behaved algorithm, the variation in the final results due to changes in the order of the merge operations is not significant. Usually, if this variation is significant, the underlying algorithm is not adequate for the job at hand. The exception to this rule is the case in which a careful numerical analysis has defined a preferred order for the reduction operation. In this case, the programmer may need to use a more expensive order-preserving reduction operation.

Implementation:

Before describing the implementation of this pattern, we remark that it is useful to remember that what we are essentially doing is isolating the dependencies so the concurrency becomes embarrassingly parallel.

Key elements.

Defining the tasks and scheduling their execution.

A set of tasks is represented and scheduled for execution on multiple units of execution (UEs). Usually, the tasks correspond to iterations of a loop. In this case we implement this pattern by splitting the loop between multiple UEs. The key to making algorithms based on this pattern run well is to schedule their execution so the load is balanced between the UEs. The approaches used for this scheduling are the same as those described in the Implementation section of the EmbarrassinglyParallel pattern.

Defining and updating a local data structure.

As discussed previously, the tasks cooperatively update one or more objects, and this pattern is not applicable if values written to such an object by one task are subsequently read by another. Once these objects have been identified, local copies must be created and initialized. It is simplest to think of each task as having its own copy of these shared objects, with their values initialized to the objects' initial values during task initialization and updated as the task performs its work. Often, the local update will in fact be a "local reduction". For example, a task in the N-body problem may handle several bodies; within the task, the local force vector updates would in fact be a reduction of the force contributions for the subset of the bodies handled by that task.

In practice it may be possible to reduce the number of physical copies required, particularly in shared-memory environments, since tasks that do not update a particular object can share a single copy of the object. In a shared-memory environment, an object that is read by multiple tasks but updated by only one need only be duplicated, with one copy maintaining its initial value (to be used by all tasks that do not modify the object) and one copy being updated by the single task that modifies the object.

Also, in practice it is usually enough to have one local copy per unit of execution (UE) rather than one local copy per task.

Combining (reducing) local objects into a single object.

The reduction step occurs after all the tasks have finished updating their local copies of the data structure. The first step is thus to determine that all partial results have been computed. In its most general form, this is almost the same as detecting termination in the EmbarrassinglyParallel pattern. The difference is that the tasks should have finished the update of the global data structure but may not have actually terminated, since they may still need to participate in the reduction operation.

The reduction step itself performs the calculation indicated earlier (result = combine(v₀ ,v₁,...,v_M-1)). In most cases the combine function can be expressed in the form (v₀ · v₁ · ... · v_M-1), where · is a binary operator. Computations of this form are sufficiently common in parallel algorithms that they are discussed separately in the Reduction pattern; many programming environments also supply high-quality implementations of this pattern.

Examples:

Matrix-vector multiplication.

This example, taken from the MPI reference manual [Snir96], uses this pattern to compute the product of a vector and a matrix. A simple sequential program to accomplish this result is as follows:

      SUBROUTINE BLAS2(M,N,a,b,c)
      REAL a(M), b(M,N)  !input vector and matrix
      REAL c(N)          !result
      INTEGER M,N,i,j

      DO j = 1,N
          c(j) = 0.0
          DO i = 1,M
              c(j) = c(j) + a(i)*b(i,j)
          END DO
      END DO

      RETURN

Each element of the result vector is a reduction of M partial sums, so the SeparableDependencies pattern applies -- we can calculate each element by computing and then combining partial sums, with the decomposition based on partitioning the input vector and matrix as shown in the following figure (shaded areas indicate data for one task).

To calculate the matrix-vector product, each UE computes a local sum (the product of its section of vector a and its section of matrix b); final values for the elements of product vector c are obtained by summing these local sums.

See the section "Matrix-Vector Multiplication, MPI Implementation" in the examples document for an implementation using MPI.

Numerical Integration.

This example performs numerical integration using the trapezoid rule. This example is almost trivial, but it effectively addresses most of the key issues raised by this pattern. The goal here is to compute pi by integrating the function over the interval from 0 to 1. But what is important here is not the mathematics but the pattern, which consists of computing a sum of individual contributions and which is parallelized by computing a number of partial sums and then combining them to get the global result.

First, here's a simple sequential version of a program that solves this problem:

    #include <stdio.h>
    static long num_steps = 100000;
    double step;
    void main ()
    {
        int i;
        double step, x, pi, sum = 0.0;
 
        step = 1.0/(double)(num_steps); 
 
        for( i = 1; i < num_steps; i++){
            x = (i-0.5) * step; 
            sum = sum + 4.0/(1.0 + x*x);
        }
        pi = sum * step;
        printf(" pi is %f \n",pi);
    }

We present two parallel implementations.

See the section "Numerical Integration, OpenMP Implementation" in the examples document for an implementation using OpenMP.
See the section "Numerical Integration, MPI Implementation" in the examples document for an implementation using MPI.

Known Uses:

This pattern has been used extensively in computational chemistry applications. The parallel computational chemistry book [Mattson95b] includes several chapters that discuss the use of this pattern. In particular:

In chapter 2, Windus, Schmidt, and Gordon use this pattern to parallelize the Fock matrix computation in the GAMESS ab initio quantum chemistry program.
In chapter 8, Baldridge uses this pattern for the parallel update of Fock matrix elements in the MOPAC semi-empirical quantum chemistry code.
In chapter 9, Plimpton and Hendrickson use this pattern in their replicated-data algorithm for molecular dynamics simulations. The algorithm looks quite similar to the simple example we present in this pattern, except that the summations occur into an array rather than into a single scalar.
In chapter 10, Mattson and Ravishanker describe the use of this pattern in molecular dynamics for modest parallelism on workstation clusters. They describe an owner-computes filter to handle the scheduling of the tasks onto UEs.

Data-parallel algorithms make heavy use of the SeparableDependencies pattern. While traditionally intended for SIMD computers, data-parallel algorithms can be implemented on other architectures as well. Several data-parallel algorithms, including some that would be suitable for reduction with associative operators, sorting algorithms, and algorithms on linked lists are described in [Hillis86].

There are also a number of very interesting sorting and other algorithms in which the "split" phase of the pattern is non-trivial. Such algorithms can be found in [JaJa92].

Related Patterns:

This pattern converts into the EmbarrassinglyParallel pattern when the dependencies are removed.

If the dependencies involve reads as well as writes, the pattern is transformed into the ProtectedDependencies pattern.