Pattern Name: DataSharing
|
FindingConcurrency Design Space
|
This pattern addresses the question "Given a way of decomposing
a problem into tasks, how is data shared among the tasks?"
This pattern constitutes the third step in analyzing
dependencies among the tasks of a problem decomposition.
The first and second steps,
addressed in the
GroupTasks and
OrderTasks patterns,
are to group tasks based on constraints among them and
then determine what ordering constraints apply to groups of tasks.
The next step, discussed here, is to analyze how
data is shared among groups of tasks, so that access to shared data
can be managed correctly.
The first two steps of the dependency analysis have focused on
how the original problem's computation was divided into tasks
(the task decomposition).
This step also takes into consideration the associated data
decomposition, that is, the division of the
problem's data into chunks that can be updated independently,
each associated with one or more tasks that handle the update of that
chunk.
This chunk of data is sometimes called "task-local" data
(or just "local" data),
since it is tightly coupled to the task(s) responsible for its
update.
It is rare, however, that each task can operate using only its
own local data; data may need to be shared among tasks in many ways.
Two of the most common situations are the following:
- In addition to task-local data, the problem's data decomposition
may define some data that must be shared among tasks; for
example, the tasks may need to cooperatively update a large shared
data structure. Such data cannot be identified with any given task;
it is inherently global to the problem.
This shared data is
modified by multiple tasks and therefore serves as a source of
dependencies between the tasks.
- Data dependencies can also occur when one task needs access to some
portion of another task's local data. The classic example of this
type of data dependency occurs in finite difference methods
parallelized using a data decomposition, where each point in the
problem space is updated using values from nearby points and
therefore updates for one chunk of the decomposition require values
from the boundaries of neighboring chunks.
This pattern discusses data sharing in parallel algorithms
and how to deal with typical forms of shared data.
Use this pattern when
- You have decomposed the problem in terms of both tasks and
data (task decomposition and data decomposition), have decided
how to combine the tasks into groups
(as discussed in the
GroupTasks pattern), and
have determined what ordering constraints apply among groups
(as discussed in the
OrderTasks pattern).
The goal of this pattern is to identify what data is shared among
groups of tasks and how to manage access to shared data in a way
that is both correct and efficient.
Data sharing can have major implications for both the
correctness and the efficiency of your program:
- If the sharing is
done incorrectly, a task may get invalid data; this happens often
in shared-address-space environments, where a task can read from a
memory location before the write of the expected data has
completed.
- Guaranteeing
that shared data is ready for use can lead to excessive
synchronization overhead. For example, you can in many cases force
a desired order by putting barrier operations before reads of
shared data, but this can be very inefficient if many
units of execution
(UEs) wait at a barrier that is only needed to properly
synchronize execution of a few UEs. A much better strategy is to
use a combination of copying into local data or restructuring tasks
to minimize the number of times shared data must be read.
- Another source of data-sharing overhead is communication. In
some parallel systems, any access to shared data implies the
passing of a message between units of execution. You can sometimes
avoid this problem by overlapping communication and computation,
but this isn't always possible. Frequently, a better choice is to
structure your algorithm and the way you define tasks so that the
amount of shared data to communicate is minimized. Another approach
is to give each unit of execution its own copy of the shared data;
this requires some care to be sure that the copies are kept
consistent in value but can be more efficient.
The goal, therefore, is to manage shared data enough to ensure
correctness but not so much as to interfere with efficiency.
We suggest the following approach to determining what data is
shared and how to manage it:
- The first step is to identify data that is shared between tasks.
The data sharing implied by your algorithm is closely connected to
the basic way you
decomposed
your problem.
This is most obvious when the decomposition is
predominantly a data-based decomposition.
For example, in a finite difference problem, the basic
data is decomposed into blocks. The nature of the decomposition
dictates that the data at the edges of the blocks is shared between
neighboring blocks. In essence, you worked out the data sharing
when the basic decomposition was done.
In a decomposition that is predominantly task-based,
the situation is more complex. At some point in the definition of
tasks, you needed to define how data passed into or out of the task
and whether any data was updated in the body of the task. These are
your sources of potential data sharing.
- Once you have identified any data that is shared, you need to
understand how the data will be used. Shared data falls into one of
the following three categories:
- Read-only. The data is read but not written. Since it
is not modified, access to these values does not need to be
protected. On some distributed-memory systems, it is worthwhile to
replicate the read-only data so each unit of execution has its own copy.
- Effectively-local. The data is partitioned into
subsets, each of which is accessed (for read or write) by only one
of the tasks. (An example of this would be an array shared among
tasks in such a way that its elements are effectively partitioned
into sets of task-local data.) This case gives the programmer some
options. If the subsets can accessed independently (as would
normally be the case with, say, array elements, but not necessarily
with list elements), then the programmer need not worry about
protecting access to this data. On distributed-memory systems, such
data would usually be distributed among UEs, with each UE having
only the data needed by its tasks. If necessary, the data can be
recombined into a single data structure at the end of the
computation.
- Read/write. The data is both read and written and is
accessed by more than one task. This is the general case, and
includes arbitrarily complicated situations in which data is read
from and written to by any number of tasks. It is the most
difficult to deal with, since any access to the data (read or
write) must be protected with some type of exclusive-access
mechanism (locks, semaphores, etc.), which can be very
expensive.
Two special cases of read/write data are common enough to
deserve special mention:
- Accumulate. The data is being used to accumulate a
result (i.e., is being used to compute a reduction). For each
location in the shared data, the values are updated by multiple
tasks, with the update taking place through some sort of
associative accumulation operation. The most common cases for the
accumulation operations are sum, minimum, and maximum, but any
associative pairwise operation will do. For such data, each task
(or, usually, each UE) has a separate copy; the accumulations occur
into these local copies, which are then accumulated into a single
global copy as a final step at the end of the computation.
- Multiple-read/single-write. The data is read by
multiple tasks (all of which need its initial value) but modified
by only task (which can read and write its value arbitrarily
often). Such variables occur frequently in algorithms based on data
decompositions. For data of this type, at least two copies are
needed, one to preserve the initial value and one to be used by the
modifying task; the copy containing the initial value can be
discarded at the end of the computation. On distributed-memory
systems, typically a copy is created for each task needing access
(read or write) to the data.
Molecular dynamics.
In the
Examples section of the
DependencyAnalysis pattern we described the problem of
designing a parallel molecular dynamics program. We then identified
the task groups (in the
GroupTasks
pattern) and considered temporal constraints between the task
groups (in the
OrderTasks
pattern). We
will ignore the temporal constraints for now and just focus on data
sharing for the problem's final task groups:
- The group of tasks to find the "bonded forces"
(vibrational forces and rotational forces) on each atom.
- The group of tasks to find the long-range forces on each atom.
- The group of tasks to update the position of each atom.
- The task to update the neighbor list for all the atoms (which
trivially constitutes a task group).
When you analyze the computations taking place with each of
these groups, you find the following shared data:
- The atomic coordinates, used by each group.
These coordinates are treated as read-only data by the
bonded force group, the long-range force group, and the
neighbor-list update group. This data is read/write for the
position update group. Fortunately, the position update group
executes alone after the other three groups are done (based on
the ordering constraints developed using the
OrderTasks pattern). Hence, in the
first three groups we can leave accesses to the position data
unprotected or even replicate it. For the position update group,
the position data belongs to the read/write category, and access to
this data will need to be controlled carefully.
- The force array, used by each group except for the
neighbor-list update.
This array is used as read-only data by the position update
group and as accumulate data for the bonded and long-range force
groups. Since the position update group must follow the force
computations (as determined using the
OrderTasks pattern),
we can put this array in the accumulate category for
the force groups and in the read-only category for the position
update group.
- The neighbor list, shared between the long-range force group
and the neighbor-list update group.
This list is used by the long-range-force
and neighbor-list update groups. It is essentially-local data for
the neighbor-list update group and read-only data for the
long-range force computation.