Next: Computational Experiments and Results
Up: Implementation of Dynamic Load
Previous: Synchronous Diffusive Load Balancing
In this third load balancing approach, a local communication
strategy and diffusive
work-adjusting scheme are used, as in SDLB. However, a major
difference here is the use
of an asynchronous nonblocking communication scheme, one of the
key capabilities of MPI.
The combination of asynchronous communication functionality
and nonblocking, persistent communication functionality not only
provides for
cheaper communication operations by eliminating communication idle
states, but also, by breaking process synchronization, makes the sequence
of events in the load balancing scheme flexible by allowing
overlap of communication and computation. As illustrated in
Fig. 6.,
when each processor can perform communication arbitrarily at any time,
and independently of a cooperating
processor, all communication operations can be scattered among
computation, with less time consumed in massage passing.
Figure 6:
ADLB uses an asynchronous, nonblocking communication scheme,
providing more flexibility to each
processor and overlapping communication and computation phases.
 |
In addition to the cheaper and more flexible communication scheme,
we incorporate into the ADLB approach
two new
strategies to try to reduce the demand for communication and thereby
try to achieve a higher
overall performance.
First, as noted above, in BP and BB methods, it is not
really necessary to maintain a completely balanced workload
across processors. The actual goal is to prevent the occurrence of
idle states by simply maintaining a workload to each processor sufficiently large
to keep it busy with computation.
To achieve balanced workloads may require a
very large number of workload transmissions, resulting in a heavy
communication burden. However, in this case, many of the workload
transmissions may be unnecessary, since in BP and BB each
processor deals with its stack one work unit (box) at a time
sequentially, leaving all other workload simply standing by.
For a processor to avoid an idle state, and thus have a
high efficiency in computation, it is not necessary that its workload be balanced with
other processors, but only that it be able to obtain additional
workload from another processor through communication as it is
approaching an out-of-work state. Thus, we use here a
receiver-initiate scheme to initiate work transfer only when the number of
boxes in a processor's stack is lower than some threshold, which should
be set high enough that
the processor
is not likely to complete the work and become idle during the processing
of the workload request to
its neighboring processors.
As a consequence, we can also implement
a second strategy, which eliminates the periodic state information
exchange and combines the load state information of the requesting processor
with the workload request message to the donor processor.
Upon receiving the request, the donor follows a diffusive work-adjusting
scheme as described above for the SDLB approach,
but with a modification in
the response to the workload adjusting index. Here,
if is positive and/or greater than a threshold, the donor
sends out workload (boxes) to the requesting processor;
otherwise, it responds that there is no extra
workload available. Thus, when approaching idle, a processor sends
out a request for work to all its
cooperating neighbors, and waits for any processor's donation of
work. In case of no work being transferred, it means that the
neighbor processors are also starved for work and are making work requests
to other neighbors.
In this case, the processor will keep requesting work from the same neighbors
until they eventually obtain extra work from remote processors and are able to
donate parts of it. Through such a
diffusive mechanism, heavily loaded processors can propagate
workload to lightly loaded processors with a small communication
expense.
The last step of this load balancing procedure is to detect global termination.
Because the ADLB scheme is asynchronous, the detection of global
termination is a more complex issue than in the synchronous case.
As noted above, a popular and effective technique for dealing with
this issue is
Dijkstra's token algorithm [53,80,81]. This
is the technique used in the ADLB scheme.
In the next section,
we describe tests of the three approaches outlined above for load balancing in
parallel BP and BB.
Next: Computational Experiments and Results
Up: Implementation of Dynamic Load
Previous: Synchronous Diffusive Load Balancing
ChaoYang Gau
2001-03-12
|