Our design goal is to make parallelization transparent and hide it in a few objects while achieving high performance. We introduce a transparent and incremental parallelization scheme for clusters with moderate number of nodes, e.g. up to 64 CPU's. The design is based on a force decomposition with a master-slave paradigm and a nearly optimal spatial alignment of the atoms. We introduce an object that parallelizes a given computational work by distributing the work and collecting the contributions. Work distribution is performed by the master(s), through a short message (work-command). Slaves retrieve the required data independently, carry out the work, and notify the master when finished, see Fig. 2a. The key to achieving good performance is the initial alignment of data, and the use of one-sided communication, as explained below.
Data may either be centralized, distributed or replicated, depending on the parallelization implementation desired. To distribute data among the slaves we use proxies. The slaves exchange data for the assigned work independently (Fig. 2b) and without any explicit synchronization via proxies; the owner of data makes it globally accessible. Access of data is transparent, as in the sequential case. Caching and a smart reduction of contributions are also hidden in the proxies. Communication is based on MPI and one-sided communication features of MPI-2 [106,113].
The first parallelization of PROTOMOL will be a centralized data implementation, where the master keeps all data. For some forces we do not expect the same performance as for a spatial decomposition, but we have the possibility to rearrange the data to get a nearly spatial decomposition. Thus, later we will do a distributed implementation, where data may also migrate from one node to another to reduce communication and eliminate data replication. We would also like to connect several PROTOMOL's via the internet, each computing one large molecule to compute more complex systems or aggregation phenomena.
Load balancing becomes an issue for large number of nodes [82,114]. Load balancing can be done either statically or dynamically. For MD one needs dynamic load balance: this may use a global or local scheme. Our load balancing is performed by master(s). For massively parallel processors (MPPs), the same overall design applies, but the details of the data management change drastically. In particular, the load balancing strategy would use a hierarchy of or several masters to avoid bottlenecks. We are currently exploring several of these issues, and preliminary results are presented in Section 4.4.
To compute the electrostatic interactions there are known methods
using fast multipole methods (FMM) requiring O(N) operations,
particle-particle particle-mesh requiring
,
and
multilevel methods. Our computation of electrostatic forces is based
on a multilevel summation method. The force is split into rapidly
varying ``fast'' and smooth parts. ``Fast'' forces from particles are
computed using a cutoff. Correction is performed on a multilevel
grid, coarsening the grid until the system size is smaller than the
cutoff. Our sequential tests of a multilevel summation for isolated
systems have shown it to be as efficient as FMM. Similar results are
reported in [137]. A description of our new
algorithms for Ewald-like summations is given in
Section 2.4.
For the electrostatic interactions on a given grid level the
distribution must be balanced and communication minimized. Thus, our
communication interface should be lightweight: we will use one-sided
communication to achieve this. For the coarser grids (
)
the communication costs dominate. Here, we will compute
several coarse grids to achieve higher accuracy. This is a better alternative than just having some nodes stand idle.
We assume there is moderate fluctuation of the atomic positions and a nearly spatial alignment of the atoms. We are not interested in very large rapidly changing systems that typically occur in material science [16,96,153]. Given these assumptions, we think that the master-slave approach will scale well. To avoid bottlenecks at the master, only few small messages are sent to assign work and to notify work completion. The slave retrieves data using MPI-2 one-sided communication, where no synchronization with the master is required, avoiding considerable idle time. Furthermore, we get a very simple implementation. This concept has been successfully tested and evaluated by Thierry Matthey, a student I am co-advising [106].
|