Pattern Name: DataDecomposition |
FindingConcurrency Design Space |
This pattern addresses the question "How do you decompose a problem's data into units that can be operated on relatively independently?"
This pattern looks at the issues involved in decomposing data into units that can be updated concurrently.
For example, most linear algebra problems update large matrices, applying a similar set of operations to each element of the matrix. In these cases, it is straightforward to drive the parallel algorithm design by looking at how the matrix can be broken up into blocks that are updated concurrently. The task definitions then follow from how the blocks are defined and mapped onto the processing elements of the parallel computer.
Use this pattern when
Compilers are good at analyzing data dependencies and can in some cases automatically deduce a data decomposition. In most cases, however, you have to carry out the decomposition by hand.
If you have already carried out a task-based decomposition, the data decomposition is driven by the needs of each task. If well-defined and distinct data can be associated with each task, the decomposition should be simple.
If you are starting with a data decomposition, however, you need to look not at the tasks but at the central data structures defining your problem, considering whether they can they be broken down into chunks that can be operated on concurrently. A few common examples are
Regardless of the nature of the underlying data structure, the decomposition of the data serves as the organizing principle of your parallel algorithm.
As you consider how to decompose the problem's data structures, keep in mind the competing forces mentioned in the DecompositionStrategy pattern:
If at all possible, you should define chunks whose size and number are controlled by a small number of parameters. These parameters define so-called "granularity knobs" that you can vary to modify the data decomposition to match the needs of the underlying hardware. (Note, however, that many designs are not infinitely adaptable with respect to granularity.)
The easiest place to see the impact of granularity on the data decomposition is in the overhead required to manage dependencies between chunks. The time required to manage dependencies must be small compared to the overall runtime. In a good data decomposition, the dependencies scale at a lower dimension than the computational effort associated with each chunk. For example, in many finite difference codes, the half-width of the finite difference stencil defines a region of cells along the surface of each decomposed chunk. This surface region defines the dependency between chunks. The size of the set of dependent cells scales as the surface area, while the effort required in the computation scales as the volume of the chunk. This means that you can scale the computational effort (based on the chunk's volume) to offset overheads associated with data dependencies (based on the surface area of the chunk).
Once you have your data decomposed, you need to look at the task decomposition implied by the tasks. The TaskDecomposition pattern may help you with this analysis.
Consider the medical imaging problem described earlier (in the Examples section of the DecompositionStrategy pattern). In this application, a point inside a model of the body is selected randomly, a radioactive decay is allowed to occur at this point, and the trajectory of the emitted particle is followed. To create a statistically significant simulation, thousands if not millions of trajectories are followed.
A task-based decomposition is a natural choice for this problem. Memory constraints, however, have motivated the development of data-based decompositions for this problem. When the memory of the underlying parallel hardware is distributed, it is advantageous to avoid replicating the huge body model on each processing element.
In a data-based decomposition, the body model is the large central data structure around which the computation can be organized. The model is broken into segments, and one or more segments are associated with each processing element. The body segments are only read, not written, during the trajectory computations, so there are no data dependencies created by the decomposition of the body model.
Once the data has been decomposed, you need to look at the tasks associated with each data segment. In this case, each trajectory passing through the data segment defines a task. The trajectories are initiated and propagated within a segment exactly as for the task-based approach. The difference occurs when a segment boundary is encountered. When this happens, the trajectory must be passed between segments. It is this transfer that defines the dependencies between data chunks.
Notice that this algorithm is more complex than one based on a task-based decomposition. Considerable effort can be required to implement the bookkeeping required to keep track of the set of trajectories as they move through the body model.
Consider the standard multiplication of two matrices (C = A · B). In the Examples section of the TaskDecomposition pattern we discussed a task-based decomposition suitable for shared-memory environments but less so for distributed-memory environments. Several data-based decompositions are possible for this problem. A straightforward one would be to assign a row of the product matrix C to each processing element. From the definition of matrix multiplication, that means that each processing element would need the full A matrix, but only the corresponding row of B. With such a data decomposition, the basic task in our algorithm becomes the computation of a row of C. This still requires the replication of too much data (the full A matrix), however, so we might refine our algorithm so that we decompose all three matrices into blocks. The basic task then becomes the update of a C block, with the A and B blocks being cycled among the nodes as the computation proceeds. The result is the data-based decomposition discussed as an example of the GeometricDecomposition pattern. Although such a block decomposition is more complex, it is nevertheless the approach used in practice since it is the most efficient.
Data decompositions are very common in parallel scientific computing. The parallel linear algebra library ScaLAPACK is a good example.