DataSharing Design Pattern

Pattern Name: DataSharing

FindingConcurrency Design Space

Intent:

This pattern addresses the question "Given a way of decomposing a problem into tasks, how is data shared among the tasks?"

This pattern constitutes the third step in analyzing dependencies among the tasks of a problem decomposition. The first and second steps, addressed in the GroupTasks and OrderTasks patterns, are to group tasks based on constraints among them and then determine what ordering constraints apply to groups of tasks. The next step, discussed here, is to analyze how data is shared among groups of tasks, so that access to shared data can be managed correctly.

The first two steps of the dependency analysis have focused on how the original problem's computation was divided into tasks (the task decomposition). This step also takes into consideration the associated data decomposition, that is, the division of the problem's data into chunks that can be updated independently, each associated with one or more tasks that handle the update of that chunk. This chunk of data is sometimes called "task-local" data (or just "local" data), since it is tightly coupled to the task(s) responsible for its update.

It is rare, however, that each task can operate using only its own local data; data may need to be shared among tasks in many ways. Two of the most common situations are the following:

In addition to task-local data, the problem's data decomposition may define some data that must be shared among tasks; for example, the tasks may need to cooperatively update a large shared data structure. Such data cannot be identified with any given task; it is inherently global to the problem. This shared data is modified by multiple tasks and therefore serves as a source of dependencies between the tasks.
Data dependencies can also occur when one task needs access to some portion of another task's local data. The classic example of this type of data dependency occurs in finite difference methods parallelized using a data decomposition, where each point in the problem space is updated using values from nearby points and therefore updates for one chunk of the decomposition require values from the boundaries of neighboring chunks.

This pattern discusses data sharing in parallel algorithms and how to deal with typical forms of shared data.

Applicability:

Use this pattern when

You have decomposed the problem in terms of both tasks and data (task decomposition and data decomposition), have decided how to combine the tasks into groups (as discussed in the GroupTasks pattern), and have determined what ordering constraints apply among groups (as discussed in the OrderTasks pattern).

Implementation:

The goal of this pattern is to identify what data is shared among groups of tasks and how to manage access to shared data in a way that is both correct and efficient.

Data sharing can have major implications for both the correctness and the efficiency of your program:

If the sharing is done incorrectly, a task may get invalid data; this happens often in shared-address-space environments, where a task can read from a memory location before the write of the expected data has completed.
Guaranteeing that shared data is ready for use can lead to excessive synchronization overhead. For example, you can in many cases force a desired order by putting barrier operations before reads of shared data, but this can be very inefficient if many units of execution (UEs) wait at a barrier that is only needed to properly synchronize execution of a few UEs. A much better strategy is to use a combination of copying into local data or restructuring tasks to minimize the number of times shared data must be read.
Another source of data-sharing overhead is communication. In some parallel systems, any access to shared data implies the passing of a message between units of execution. You can sometimes avoid this problem by overlapping communication and computation, but this isn't always possible. Frequently, a better choice is to structure your algorithm and the way you define tasks so that the amount of shared data to communicate is minimized. Another approach is to give each unit of execution its own copy of the shared data; this requires some care to be sure that the copies are kept consistent in value but can be more efficient.

The goal, therefore, is to manage shared data enough to ensure correctness but not so much as to interfere with efficiency. We suggest the following approach to determining what data is shared and how to manage it:

The first step is to identify data that is shared between tasks. The data sharing implied by your algorithm is closely connected to the basic way you decomposed your problem.
This is most obvious when the decomposition is predominantly a data-based decomposition. For example, in a finite difference problem, the basic data is decomposed into blocks. The nature of the decomposition dictates that the data at the edges of the blocks is shared between neighboring blocks. In essence, you worked out the data sharing when the basic decomposition was done.
In a decomposition that is predominantly task-based, the situation is more complex. At some point in the definition of tasks, you needed to define how data passed into or out of the task and whether any data was updated in the body of the task. These are your sources of potential data sharing.
Once you have identified any data that is shared, you need to understand how the data will be used. Shared data falls into one of the following three categories:
- Read-only. The data is read but not written. Since it is not modified, access to these values does not need to be protected. On some distributed-memory systems, it is worthwhile to replicate the read-only data so each unit of execution has its own copy.
- Effectively-local. The data is partitioned into subsets, each of which is accessed (for read or write) by only one of the tasks. (An example of this would be an array shared among tasks in such a way that its elements are effectively partitioned into sets of task-local data.) This case gives the programmer some options. If the subsets can accessed independently (as would normally be the case with, say, array elements, but not necessarily with list elements), then the programmer need not worry about protecting access to this data. On distributed-memory systems, such data would usually be distributed among UEs, with each UE having only the data needed by its tasks. If necessary, the data can be recombined into a single data structure at the end of the computation.
- Read/write. The data is both read and written and is accessed by more than one task. This is the general case, and includes arbitrarily complicated situations in which data is read from and written to by any number of tasks. It is the most difficult to deal with, since any access to the data (read or write) must be protected with some type of exclusive-access mechanism (locks, semaphores, etc.), which can be very expensive.
Two special cases of read/write data are common enough to deserve special mention:
- Accumulate. The data is being used to accumulate a result (i.e., is being used to compute a reduction). For each location in the shared data, the values are updated by multiple tasks, with the update taking place through some sort of associative accumulation operation. The most common cases for the accumulation operations are sum, minimum, and maximum, but any associative pairwise operation will do. For such data, each task (or, usually, each UE) has a separate copy; the accumulations occur into these local copies, which are then accumulated into a single global copy as a final step at the end of the computation.
- Multiple-read/single-write. The data is read by multiple tasks (all of which need its initial value) but modified by only task (which can read and write its value arbitrarily often). Such variables occur frequently in algorithms based on data decompositions. For data of this type, at least two copies are needed, one to preserve the initial value and one to be used by the modifying task; the copy containing the initial value can be discarded at the end of the computation. On distributed-memory systems, typically a copy is created for each task needing access (read or write) to the data.

Examples:

Molecular dynamics.

In the Examples section of the DependencyAnalysis pattern we described the problem of designing a parallel molecular dynamics program. We then identified the task groups (in the GroupTasks pattern) and considered temporal constraints between the task groups (in the OrderTasks pattern). We will ignore the temporal constraints for now and just focus on data sharing for the problem's final task groups:

The group of tasks to find the "bonded forces" (vibrational forces and rotational forces) on each atom.
The group of tasks to find the long-range forces on each atom.
The group of tasks to update the position of each atom.
The task to update the neighbor list for all the atoms (which trivially constitutes a task group).

When you analyze the computations taking place with each of these groups, you find the following shared data:

The atomic coordinates, used by each group.
These coordinates are treated as read-only data by the bonded force group, the long-range force group, and the neighbor-list update group. This data is read/write for the position update group. Fortunately, the position update group executes alone after the other three groups are done (based on the ordering constraints developed using the OrderTasks pattern). Hence, in the first three groups we can leave accesses to the position data unprotected or even replicate it. For the position update group, the position data belongs to the read/write category, and access to this data will need to be controlled carefully.
The force array, used by each group except for the neighbor-list update.
This array is used as read-only data by the position update group and as accumulate data for the bonded and long-range force groups. Since the position update group must follow the force computations (as determined using the OrderTasks pattern), we can put this array in the accumulate category for the force groups and in the read-only category for the position update group.
The neighbor list, shared between the long-range force group and the neighbor-list update group.
This list is used by the long-range-force and neighbor-list update groups. It is essentially-local data for the neighbor-list update group and read-only data for the long-range force computation.

Intent:

Motivation:

Applicability:

Implementation:

Examples:

Molecular dynamics.