Pattern Name: DesignEvaluation |
FindingConcurrency Design Space |
In this pattern, we evaluate the design so far, and decide whether to revisit the design or move on to the next design space.
The patterns in the FindingConcurrency design space have helped the designer expose the concurrency in his or her problem. In particular, the original problem has been analyzed to produce:
We will use this information in the next design space -- the AlgorithmStructure design space -- to construct an algorithm that can exploit this concurrency in a parallel program.
In some cases, the concurrency is straightforward and there is clearly a best way to decompose a problem to expose it. More commonly, however, there are many ways to decompose a problem into tasks. Choosing one may require tradeoffs between three criteria of a good design: simplicity, flexibility, and efficiency. Unfortunately, there is no foolproof way to be sure that you have defined the right set of tasks or even that the tasks have been correctly grouped. The design process is inherently iterative. This pattern will help you decide evaluate your design and decide whether to move on to the next design space or revisit the decomposition.
Use this pattern when
This pattern has two goals: to evaluate the design so far (with the possible result that the programmer decides to revisit and possible revise decisions made thus far) and to prepare for the next phase of the design process. We therefore describe how to evaluate the design from three perspectives: suitability for the target platform, design quality, and preparation for the next phase of the design.
While it is desirable to delay mapping a program onto a particular target platform as long as possible, the characteristics of the target platform do need to be considered at least minimally while evaluating your design. Below are some issues relevant to the choice of target platform or platforms.
With some exceptions, having many more tasks than processing elements (PEs) makes it easier to keep all the PEs busy. Obviously we can't make use of more PEs than we have tasks, but having only one, or a few, tasks per PE can lead to poor load balance load balance. For example, consider the case of a Monte Carlo simulation in which a calculation is repeated over and over for different sets of randomly chosen data, such that the time taken for the calculation varies considerably depending on the data. A natural approach to developing a parallel algorithm would be to treat each calculation (for a separate set of data) as a task; these tasks are then completely independent and can be scheduled however we like. But since the time for each task can vary considerably, unless there are many more tasks than PEs it will be difficult to achieve good load balance.
The exceptions to this rule are designs in which the number of tasks can be adjusted to fit the number of PEs in such a way that good load balance is maintained. An example of such a design is the block-based matrix multiplication algorithm described in the Examples section of the DataDecomposition pattern: Tasks correspond to blocks, and all the tasks involve roughly the same amount of computation, so adjusting the number of tasks to be equal to the number of PEs produces an algorithm with good load balance. (Note, however, that even in this case it might be advantageous to have more tasks than PEs, if for example that would allow overlap of computation and communication.)
A design that involves large-scale or fine-grained data sharing among tasks will be easier to implement and more efficient if all tasks have access to the same memory. Ease of implementation depends on programming environment; an environment based on shared-memory model (all units of execution share an address space) makes it easier to implement a design requiring extensive data sharing. Efficiency depends also on the target machine; a design involving extensive data-sharing is likely to be more efficient on a symmetric multiprocessor (where access time to memory is uniform across processors) than on a machine that layers a shared-memory environment over physically distributed memory. In contrast, if you plan to implement your design using a message-passing environment running on a distributed-memory architecture, a design involving extensive data sharing is likely not a good choice.
For example, consider the task-based approach to medical imaging problem described in the Examples section of the TaskDecomposition pattern. This design requires that all tasks have read access to a potentially very large data structure (the body model), which presents no problems in a shared-memory environment but in a distributed-memory environment can require prohibitive amounts of memory or communication.
A design that requires fine-grained data-sharing (in which the same data structure is accessed repeatedly by many tasks, particularly when both reads and writes are involved) is also likely to be more efficient on a shared-memory machine, because the overhead required to protect each access is likely to be smaller than for a distributed-memory machine.
The only exceptions to these principles would be problems in which it is easy to group and schedule tasks in such a way that the only large-scale or fine-grained data sharing is among tasks assigned to the same unit of execution.
In essence, this question asks you to revisit the preceding two questions, but in terms of units of execution (UEs) rather than processing elements (PEs). This can be an important distinction to make if the target system supports multiple UEs per PE, particularly if the target system emphasizes the use of multitasking to hide latency (an example of such a system is the Tera machine).
There are two factors to keep in mind when considering whether a design using more than one UE per PE makes sense.
The first factor is whether the target system provides efficient support for multiple UEs per PE. Some systems do provide such support (an example is the Tera machine, which was designed to provide efficient support for many more threads (UEs) than processors (PEs)). Other systems do not (an example is an MPP system with one processor per node and slow context-switching, where multiple processes (UEs) per processor (PE) are likely to be inefficient).
The second factor is whether the design can make good use of multiple UEs per PE. For example, if the design involves coordination operations with high latency, it might be possible to mask that latency by assigning multiple UEs to each PE. If however the design involves coordination operations that are tightly synchronized (e.g., pairs of blocking send/receives) and relatively efficient, assigning multiple UEs to each PE is more likely to interfere with ease of implementation (by requiring extra effort to avoid deadlock) than to improve efficiency.
A critical factor in determining whether a design is effective is the ratio of time spent doing computation to time spent managing data dependencies ("coordination" -- i.e., synchronization or communication among processing elements): The higher the ratio, the more efficient the program. This ratio is affected not only by the number and type of coordination events required by the design but also by the characteristics of the target platform. For example, a message-passing design that is acceptably efficient on an MPP with a fast interconnect network and relatively slow processors will likely be less efficient, perhaps unacceptably so, on an Ethernet-connected network of powerful workstations.
Note also that this critical ratio is also frequently affected by problem size relative to the number of available processing elements, since for a given problem size the time spent by each processor doing computation decreases with the number of processors, while the time spent by each processor doing coordination may stay the same or even increase as the number of processors increases.
Keeping these characteristics of the target platform in mind, we can evaluate the design along the three dimensions of flexibility, efficiency, and simplicity.
You would like your high level design to be adaptable to a variety of different implementation requirements, and certainly all the ones that you care about. The following is a partial checklist of factors that affect flexibility.
For example, a common operation is to transpose a matrix such that a distribution in terms of blocks of matrix columns becomes a distribution in terms of blocks of matrix rows. It's easy to write down the algorithm and code it for square matrices where the matrix order is evenly divided by the number of processing elements. But what if the matrix is not square, or what if the number of rows is much greater than the number of columns and neither number is evenly divided by the number of nodes? This requires significant changes to the transpose algorithm. For a rectangular matrix, for example, the buffer that will hold the matrix block will need to be large enough to hold the larger of the two blocks. If either the row or column dimension of the matrix is not evenly divisible by the number of nodes, then the blocks will not be the same size on each node. Can your algorithm deal with the uneven load that will result from having different block sizes on each node?
You would like your program to effectively utilize the available computing resources. The following is a partial list of important factors to check. Note that typically it is not possible to simultaneously optimize all of these factors; design tradeoffs are inevitable.
This is easier if the tasks are independent, or if they are roughly the same size.
Overhead can come from several sources, including thread creation and scheduling, communication, and synchronization.
Thread creation and scheduling involve overhead, so you should make sure that each thread has enough work to do to justify this overhead. On the other hand, more threads allow for better load balance.
Communication can also be a source of significant overhead, particularly in platforms that use message-passing; message transfer typically involves both overhead due to kernel calls and latency due to the time it takes the message to travel over the network. While network latency can sometimes be hidden by overlapping it with computation, to minimize the overhead due to kernel calls, the number of messages to be sent should be minimized. Even on shared-memory multiprocessors, data should be localized as much as possible.
Synchronization is required whenever a dependency requires one task to wait for another -- either because the result is needed, or to avoid race conditions. Designs that minimize dependencies may reduce synchronization overhead.
To paraphrase Einstein: Make it as simple as possible, but not simpler.
Keep in mind that you will ultimately need to debug any program you write. A design -- even a generally superior design -- will not do you any good if you cannot debug, maintain, and verify the correctness of the final program.
The medical imaging example given in the DecompositionStrategy pattern is an excellent case in point in support of the value of simplicity. In this problem a large database could be decomposed, but this decomposition would force the parallel algorithm to include complex operations for passing a simulation point from one database chunk to another. This complexity makes the resulting program much more difficult to understand and greatly complicates debugging. The other approach, replicating the database, leads to a vastly simpler parallel program in which completely independent tasks can be passed out to multiple workers as they are read. All complex communication thus goes away, and the parallel part of the program is trivial to debug and/or reason about.
The problem decomposition carried out with the FindingConcurrency patterns serves to prepare for the next design space. Many of the key issues that determine a high-quality problem decomposition cannot be described, let alone resolved, until you start working with the AlgorithmStructure patterns. However, here are some key issues to keep in mind as you enter the next design space.
In other words, do they vary widely among themselves? If so, the scheduling of the tasks and their sharing of data may be an important issue. In a regular decomposition, all the tasks are in some sense the same -- roughly the same computation (on different sets of data), roughly the same dependencies on data shared with other tasks, etc. Examples include the various matrix multiplication algorithms described in the Examples sections of the TaskDecomposition and DataDecomposition patterns.
In an irregular decomposition, the work done by each task and/or the data dependencies vary among tasks. For example, consider a discrete-event simulation of a large system consisting of a number of distinct components. We might design a parallel algorithm for this simulation by defining a task for each component and having them interact based on the discrete events of the simulation. This would be a very irregular design in that there would be considerable variation among tasks with regard to work done and dependencies on other tasks.
(Note this question is closely related to that of regular versus irregular.)
In some designs, the interaction between tasks is also very regular with regard to time -- i.e., it is synchronous. For example, a typical approach to parallelizing a linear-algebra problem involving the update of a large matrix is to partition the matrix among tasks and have each task update "its" part of the matrix, using data from both "its" and other parts of the matrix. Assuming that all the data needed for the update is present at the start of the computation, these tasks will typically first exchange information and then compute independently. Another type of example is a pipeline computation, in which we perform a multi-step operation on a sequence of sets of input data by setting up an assembly line of tasks (one for each step of the operation), with data flowing from one task to the next as each task accomplishes its work. This approach works best if all of the tasks stay more or less in step -- i.e., if their interaction is synchronous.
In other designs, the interaction between tasks is not so chronologically regular. An example is the discrete-event simulation described previously (in the discussion of regular versus irregular), in which the events that lead to interaction between tasks may be chronologically irregular.
The temporal relations are easy: Tasks that can run at the same time are naturally grouped together. But an effective design will also group tasks together based on their logical relationship in the overall problem.
As an example of grouping tasks, consider the molecular dynamics problem discussed in the Examples section of the DependencyAnalysis pattern. The grouping we eventually arrive at (in the GroupTasks pattern) is hierarchical: groups of related tasks based on the high-level operations of the problem, further grouped on the basis of which ones can execute concurrently. Such an approach makes it easier to reason about whether the design meets the necessary constraints (since the constraints can be stated in terms of the task groups defined by the high-level operations) while allowing for scheduling flexibility.