next up previous
Next: Collaboratory Front End Up: Protomol Software Infrastructure Previous: Protomol Software Infrastructure

High Performance Parallel Back End

The most computationally expensive part of MD is the force computation between various atoms. To meet this computational demand there are four basic strategies [140] for parallelization. A master-slave approach which allocates work among slaves and has both communication and load balancing difficulties. Second, the atom decomposition which is based on data replication. It is an easy but memory expensive approach with poor scaling due to global communication. Examples of programs that use this decomposition are included in [30,135,152]. Third, the force decomposition which involves either force matrix or systolic loops decompositions. It scales better than atom decomposition by reducing communication costs through the use of a block-decomposition as described in [72,118,121,149] and finally, spatial decomposition, which is the most scalable of these decompositions, but is more difficult to write and extend. It has been used in [16,31,41,81,82,106].

Our design goal is to make parallelization transparent and hide it in a few objects while achieving high performance. We introduce a transparent and incremental parallelization scheme for clusters with moderate number of nodes, e.g. up to 64 CPU's. The design is based on a force decomposition with a master-slave paradigm and a nearly optimal spatial alignment of the atoms. We introduce an object that parallelizes a given computational work by distributing the work and collecting the contributions. Work distribution is performed by the master(s), through a short message (work-command). Slaves retrieve the required data independently, carry out the work, and notify the master when finished, see Fig. 2a. The key to achieving good performance is the initial alignment of data, and the use of one-sided communication, as explained below.

Data may either be centralized, distributed or replicated, depending on the parallelization implementation desired. To distribute data among the slaves we use proxies. The slaves exchange data for the assigned work independently (Fig. 2b) and without any explicit synchronization via proxies; the owner of data makes it globally accessible. Access of data is transparent, as in the sequential case. Caching and a smart reduction of contributions are also hidden in the proxies. Communication is based on MPI and one-sided communication features of MPI-2 [106,113].

The first parallelization of PROTOMOL will be a centralized data implementation, where the master keeps all data. For some forces we do not expect the same performance as for a spatial decomposition, but we have the possibility to rearrange the data to get a nearly spatial decomposition. Thus, later we will do a distributed implementation, where data may also migrate from one node to another to reduce communication and eliminate data replication. We would also like to connect several PROTOMOL's via the internet, each computing one large molecule to compute more complex systems or aggregation phenomena.

Load balancing becomes an issue for large number of nodes [82,114]. Load balancing can be done either statically or dynamically. For MD one needs dynamic load balance: this may use a global or local scheme. Our load balancing is performed by master(s). For massively parallel processors (MPPs), the same overall design applies, but the details of the data management change drastically. In particular, the load balancing strategy would use a hierarchy of or several masters to avoid bottlenecks. We are currently exploring several of these issues, and preliminary results are presented in Section 4.4.

To compute the electrostatic interactions there are known methods using fast multipole methods (FMM) requiring O(N) operations, particle-particle particle-mesh requiring $O(N \log N)$, and multilevel methods. Our computation of electrostatic forces is based on a multilevel summation method. The force is split into rapidly varying ``fast'' and smooth parts. ``Fast'' forces from particles are computed using a cutoff. Correction is performed on a multilevel grid, coarsening the grid until the system size is smaller than the cutoff. Our sequential tests of a multilevel summation for isolated systems have shown it to be as efficient as FMM. Similar results are reported in [137]. A description of our new algorithms for Ewald-like summations is given in Section 2.4. For the electrostatic interactions on a given grid level the distribution must be balanced and communication minimized. Thus, our communication interface should be lightweight: we will use one-sided communication to achieve this. For the coarser grids ( $ P
\approx N$) the communication costs dominate. Here, we will compute several coarse grids to achieve higher accuracy. This is a better alternative than just having some nodes stand idle.

We assume there is moderate fluctuation of the atomic positions and a nearly spatial alignment of the atoms. We are not interested in very large rapidly changing systems that typically occur in material science [16,96,153]. Given these assumptions, we think that the master-slave approach will scale well. To avoid bottlenecks at the master, only few small messages are sent to assign work and to notify work completion. The slave retrieves data using MPI-2 one-sided communication, where no synchronization with the master is required, avoiding considerable idle time. Furthermore, we get a very simple implementation. This concept has been successfully tested and evaluated by Thierry Matthey, a student I am co-advising [106].


   
Figure 2: (a) The master-slave paradigm, (b) The data (memory) exchange between the master and the slaves.




next up previous
Next: Collaboratory Front End Up: Protomol Software Infrastructure Previous: Protomol Software Infrastructure
Thomas Brandon Slabach
2000-07-28