A combination of computational models and simulations are being used to consider proposed Petaflop hardware and system configurations, and MD data decomposition. Although parallel implementations of MD simulations exist, none have been attempted at the scale that we are proposing and more accurate algorithmic complexity models are needed to aid in such attempts.
A computational model of algorithmic scalability is being developed to allow us to determine computationally the various scalability and system parameters needed to achieve a petaflop. Computational models and simulation are used frequently in the development and analysis of architectures both to determine design parameters and to analyze performance [4]. This computational model must take into account a number of characteristics of the system's execution such as the multi-threaded processors, communication costs, and memory access latency. Currently we have an initial model that addresses the issue of the number of threads needed per processing node to mask the memory access latencies. Ideally we want to mask the remote access latency as that would be the longest. In order to compute the number of threads needed to do this we compute the latency of a remote access divided by the number of operations per thread.
The principal mechanism for benchmarking throughout the project is the use of the SHADE suite developed by Sun Microsystems. The simulation code used for our initial numbers is SAMD 2, an early serial version of PROTOMOL [155]. This code was broken down into individual threads to be mapped onto the multi-threaded SHADE simulator. From these numbers, and with a communication cost model, we have computed the number of threads needed to mask the average memory latency in this application as 16 threads.
Future work is already underway to extend the current computational model to consider thread movement and communication issues as well as to use a hybrid simulation. This hybrid model will help to further elucidate the system configuration as well as providing a means to test various software models. Additionally there are more efficient algorithms to be explored, such as fast electrostatics, as well as different decompositions to determine the best fit for our system. It would also be of use to test various decompositions on current MPPs to gather information that will help us in mapping them to larger systems. In particular, Guang Gao of the University of Delaware has two compilers that create executable multithreaded code representative of that for one of our Petaflop systems. The ability to use these compilers on PROTOMOL will allow us gather performance numbers important both to the design of our system as well as to the scalable mapping of MD code to the system. An additional source of testing would come from testing on the Tera machine, and possibly some Blue Gene prototypes.
| PROJECT DESCRIPTION: II. EDUCATIONAL AND CAREER PLAN |