next up previous
Next: Drug Design, Protein Folding, Up: No Title Previous: No Title

Introduction

The objective of this project is to develop algorithms and high-performance software that helps to solve outstanding problems in computational biology. The first application of molecular simulation is in computer-aided drug design, where one attempts to predict biochemical properties of drug candidates and improve their characteristics to maximize clinical potency and to minimize undesirable side effects [35,40]. This will be undertaken for anti-breast-cancer drugs in collaboration with a cancer research center. The second application is in protein folding and protein dynamics, where one predicts the tertiary or globally folded three-dimensional structure of proteins from their locally folded secondary structure. These computations might be helpful in interpreting data from the human genome project, understanding the mechanism of some diseases, designing drugs, and even growing polymers with specific properties [115]. This subproject will be undertaken for proteins of biological interest. Finally, the last is in molecular modeling, where simulations will help decide the best molecular design and configuration of a molecular quantum cellular automaton cell.

Limitations to the use of computers in applications described above are threefold: (i) systems simulated are large and nonlinear; (ii) simulations are computationally expensive which sometimes renders them impractical; and (iii) there is often a communicational barrier between biological scientists and computer or computational scientists. This project tries to address the technical aspects and makes contributions towards facilitating interaction among biological and computational scientists through several educational and collaborative initiatives. More specifically, this project attempts to construct more efficient algorithms for the core technologies underlying all of the above applications, molecular dynamics (MD) and hybrid Monte Carlo (HMC), as discussed in Section 2.1. It also seeks to incorporate these algorithms into kernels of high performance problem-solving environments, for clusters (Section 3) and massively parallel machines (Section 4.4) to assist specific applications in drug design, protein folding, and molecular modeling (Sections 4.1-4.3).

MD is a natural tool to explore the conformational space of molecules, particularly proteins and other biomolecules [89, p. 434]. A conformation is a set of nearby spatial configurations around some stable equilibrium point, typically believed to be an energy minimum [47]. For example, MD simulations have been used to permit ligands and receptors explore their configurations in space to determine the efficacy of binding or docking. Study of the dynamics itself is important since the biochemical functionality of some complex systems such as enzymes depend on dynamical processes like gating or side-chain rotations in receptor molecules [107,160]. MD is also commonly used to study thermodynamic properties of systems [5, p. 46].

In MD simulation, for a classical unconstrained system one wants to solve the N-body problem, i.e., Newton's equations of motion or some extension of these equations. This is a challenging problem not only because the number of particles is large but also because stability severely limits the length of the time steps relative to the total length needed for simulations--time steps are in the order of femtoseconds (10-15 seconds) whereas simulations of a few microseconds (10-6 seconds) up to one second are most desired for a process such as protein folding. The presence of multiple spatial and time scales is responsible for this problem. Most integrators have to resolve the models at very fine scales in time and space whereas processes of interest occur at much larger scales. Force interactions associated with the longest time step are called ``slow'' and those with the shortest time step are called ``fast''. This work attempts to construct multiple time stepping (MTS) integrators which are stable for long time steps. This can be accomplished with multiscale MTS integrators, in which the time step associated with the ``slow'' forces is not restricted by the ``fast'' forces. Not all MTS integrators are multiscale. In Verlet-I/r-RESPA [58,59,71,104,148,154], for example, the longest stable time step for ``slow'' forces is limited by the presence of ``fast'' forces, see Section 2.1. Finally, since the solution to MD is chaotic, one may expect accurate solutions for short runs only. For long simulations, however, one would like to preserve some probability distribution underlying the system. It is believed that symplectic integrators have such property [115,129,136].

The challenges of MD just described are addressed in this proposal as follows. The size of the systems will be tackled by efficient parallel implementations of all algorithms described here, including fast electrostatic force summation methods. Multiple and long time scales will be handled by constructing multiscale descriptions of the systems and by devising multiscale integrators and solvers for MD and HMC. Finally, the chaotic nature of the MD solution will be addressed by making the integrators symplectic.

Several ideas in our work have produced encouraging results towards multiscale MTS integrators: Perturbation of the potential by defining time-averaged positions, making the force a gradient of this perturbed potential, produces a mollified or MOLLY integrator [54]. This overcomes the instability barriers of Verlet-I/r-RESPA methods and has allowed for a 50% increase in the longest time step possible. We have developed Equilibrium*, which is a multiscale integrator [74,77]. We have also successfully used domain-specific knowledge to improve MOLLY integrators, specifically integrators for hydrogen-bonded systems such as solvated biomolecules [8]. Finally, using very mild stochastic damping to stabilize MOLLY integrators has allowed for a threefold increase in the longest time step for MD [75,76]. By choosing the damping carefully we have obtained correct dynamics, i.e., as measured by self-diffusion coefficients. We call this technique Langevin stabilization. These results have been incorporated into a parallel MD code, NAMD 2.1 [81]. I propose to continue this numerical research along the following lines, all of which contribute towards enabling longer and more accurate simulations for the study of important systems : Develop efficient implementations of MOLLY methods for biomolecules; Build multiscale MOLLY for constant temperature and pressure ensembles, which are more stable and more closely mimic experimental results, and extend the idea of Langevin stabilization to different ensembles; Quantify the effect of stochasticity on the dynamics and statistical accuracy of long simulations when using Langevin stabilization, extended-system Hamiltonian, and HMC; and do nonlinear analysis of MTS methods using simple nonlinear model problems. A three-water model problem that I have developed has correctly predicted instabilities; Develop linear complexity methods for full electrostatics, which is one of the most expensive computations in MD; these fast electrostatics methods will work for non-periodic and periodic boundary conditions and will parallelize well. Both methods will use particle/grid summation ideas; and in the periodic case, wavelet transforms will be used to produce efficient Ewald-like sums. More details on all these methods are found in Section 2.1.

Another objective of this proposal is to make the results of this research available to the scientific community in the form of collaboratory open-source software called PROTOMOL, which is being developed to serve a need for a relatively simple program that is easy to use and modify. PROTOMOL will have an intuitive scripting and web-based interface that allows users to easily prototype their own methods and to utilize provided high performance kernels in a transparent manner. A program with similar goals is NAMD 2, which has an object-oriented design, with abundant documentation, but designed primarily for scalable parallel MD. This limits its usability as a platform to develop new algorithms, since the user must consider parallel issues to modify or add any algorithm. Other excellent MD programs exist, such as Amber [152], CHARMM [30], X-PLOR [32], PINY_MD [149], and others [16,31,41,72,106,135] described in Section 3. Few of those programs have been built with the main goal of being an extensible platform for algorithmic development: PINY_MD is one of those few, but it was not designed in a truly object oriented language, although efforts are underway to convert its non high-performance components to Java.

PROTOMOL has been inspired by several projects in modern scientific software engineering. Scriptable MD programs for material science such as SPASM [17], and somewhat older programs like X-PLOR [32], have demonstrated the desirability of using scripting languages. Collaboratories for biomolecular modeling such as BioCoRe [19] attempt to facilitate the remote collaboration of users of simulation and visualization software. Object oriented frameworks such as POOMA [62], and object oriented libraries such as OOMPAA [69], offer examples of powerful abstractions for scientific codes. Generic libraries such as the STL [132], Blitz++ [151], and MTL [134], have taught us about how to write high performance software in C++ [144]. Finally, newer parallel standards such as MPI-2 [113] make it possible to write better parallel programs than before.

PROTOMOL incorporates several of these ideas and some of its own. It is componentized with a high performance back end, a collaboratory-capable front end, database, and web interfaces. The back end design allows for incremental parallelization so that at first, only performance-critical regions of the program need to be parallelized. This is important because it allows method developers to test sequential implementations of their algorithms. We intend to parallelize the nonbonded force and fast electrostatics computations which form the bulk of computation in MD. We will hide parallelism from the user and will use MPI-2 one-sided communication to provide a global space that simplifies our design and substantially improves performance. A student in my group has obtained very good results with a 128 node MD code on the SGI Origin 2000 using MPI-2 already [106]. The front end is scriptable, using the Simplified Wrapper and Interface Generator (SWIG [44]), so that it is not limited to only one programming language. Selection commands will allow a user to operate in subsets of biochemical constructs such as atoms, residues, chains, and segments. More importantly, this allows for extensible method development: it is particularly simple to add MTS integrators, grid/particle methods, and different ensemble HMC methods, thanks to abstractions provided through the scripting language. Expert-system-like rules will be incorporated to simplify PROTOMOL's use: simulation parameters, structure generation (from coordinates to molecular topology), and architectural tuning will be automated as much as possible. These are the features that will enable PROTOMOL to become a problem-solving environment customized for the applications addressed by current and future users.

An important part of these problem-solving environments will be the ability to steer simulations according to experimental data or human knowledge. For example, Steered Molecular Dynamics (SMD) has been proposed as a way of explaining the mechanics of biopolymers and finding the underlying unbinding potentials (the amount of force needed to ``unfold'' a protein, for example) [61,79,84,97,124,157]. SMD, and other interactive MD techniques, might be helpful in our protein folding and anti-breast-cancer drug research collaborations.

An initial release of PROTOMOL includes many of the above features and its design has provisions for future incorporation of all other features. Currently, it provides the basic components: high performance sequential back end; Tcl/Tk scripting and text front ends; the SWIG interface; and classes to support generic boundary conditions and force fields. The parallelization of the back end, the API for interactive molecular dynamics using the visualization program VMD [70], the web interface, and the database connectivity are currently underway (see letters of support).

This project will enable multiple natural collaborations which will enhance its scientific value and broaden its impact. These collaborations will provide the experience and feedback to make the program easy to use. We will use existing software to get our collaborators started, primarily NAMD 2.1 and VMD, while we develop our platform to a point where they can use it. Then we will assist them in switching to PROTOMOL to take advantage of the improved algorithms and interfaces. To help interoperability our software will be capable of manipulating standard file formats for MD. These collaborations are briefly described here.



 
next up previous
Next: Drug Design, Protein Folding, Up: No Title Previous: No Title
Thomas Brandon Slabach
2000-07-28