Relaxation Estimation of RMSD in Molecular Dynamics Immunosimulations

Molecular dynamics simulations have to be sufficiently long to draw reliable conclusions. However, no method exists to prove that a simulation has converged. We suggest the method of “lagged RMSD-analysis” as a tool to judge if an MD simulation has not yet run long enough. The analysis is based on RMSD values between pairs of configurations separated by variable time intervals Δt. Unless RMSD(Δt) has reached a stationary shape, the simulation has not yet converged.


Introduction
Mathematical modeling and computational biology have proved extremely successful in the efforts to understand the mechanisms of immunologic reactions, for example, in modeling the T-cell proliferation dynamics following hepatitis C, HIV infection [1], or the growth of an immunogenic tumor [2]. Methods of statistical mechanics may successfully be applied in order to quantitatively understand the immune system [3,4], and different modeling approaches are adequate for acquired immunity, as surveyed in a seminal paper by Perelson and Weisbush [5]. Modeling and simulation can be performed on different levels, starting at the top level with agent-based models [6,7] for the cooperation of numerous large biomolecules in the formation of the immune synapse [8][9][10][11] down to more detailed models of antigen binding [12], recognition, and signaling [13][14][15][16][17][18][19][20]. T-cell proliferation modulated by interleukin 2 has been simulated [21]. Applications aiming to predict the reaction to epitopes [22] improved vaccine design [23], vaccination against tumors [24], and for improved patient care have been devised [25][26][27], also in personalized medicine [28]. In particular, the mechanisms how T-cell receptors (TCRs) detect antigen peptides (p) presented by major histocompatibility complex (MHC) molecules can be investigated by molecular dynamics (MD) on a molecular or even atomic level [29][30][31][32]. However, TCRpMHC complexes are huge protein complexes, which have to be studied in water, which adds an even larger number of atoms to model and simulate. For these reasons, sufficiently long simulations are mandatory in order to obtain realistic results. This is even more so since molecular recognition phenomena may operate on rather long timescales as compared to the length of usual MD trajectories. Hence, efforts to assess sampling quality are mandatory in such simulations.
Given the trajectory of an actual MD simulation, it is clear from first principles that no formal check whatsoever can prove that complete sampling has been achieved. If parts of the phase space have not been visited yet, there is no possibility that this becomes evident from looking Black curve: RMSD between first configuration and all successive ones. Blue curve: for a given configuration (t j ), RMSD to each of the other configurations of the same trajectory was computed and averaged, yielding RMSD(t j ). That value of t j for which RMSD(t j ) is maximum is then adopted as a reference (t 1 ) to plot RMSD values of the whole trajectory. Red curve: as blue curve, but for minimum RMSD(t j ). Note that the RMSD between the reference configuration and itself is zero by definition.
only at data from those other parts that have been visited. Zhang et al. [33] proposed a formal approach to decompose the phase space into Voronoi polyhedra [34,35]. At any rate, formal checks of sampling quality can only draw on simulation data already produced and at the most detect possible shortcomings of those trajectories. The situation is analogous to tests for the randomness of pseudorandom-number generators [36]. One can never prove randomness as such. It is only possible to detect specific deviations from randomness, such as serial correlations-and any such detection works only with a preselected error rate.
Likewise, MD trajectories may be investigated for specific markers of nonrandomness, the most important type being trends of energy and molecular deformations. This work focuses on the latter, deviations in shape being quantified via the root mean square deviation (RMSD) [37] at time t 2 with respect to a given reference structure at time t 1 : where x i (t) is the position of atom i at time t and N is the total number of atoms in the molecule. Often, the first frame of a trajectory (t 1 ) is used as a reference, and values of RMSD(t 1 , t 2 ) are computed for all successive (t 2 > t 1 ) frames; see the black curve in Figure 1. RMSD monitored this way shows large rapid fluctuations on top of long-term variations and jumps. Generally, it is difficult to identify such long-term variations, which may relate to functional modes. A recent study [38] reflected the insufficiency of visual RMSD inspection alone.
Extensive work has been published on using the RMSD for the characterization of structural changes, drifts, and trends; see [33] and the references cited therein. Grossfield and Zuckerman [39] gives a seminal conceptual discussion of ergodicity, absolute and relative convergence and proposes checks for overall sampling quality. Block averaging was proposed [40] to reduce short-term fluctuations and to obtain more reliable indicators of long-term trends. However, averaging RMSD over all configurations within a block involves many different time intervals Δt = t j − t i , thus reducing the specificity of the resulting average for one particular time interval.
For this reason we consider the RMSD between pairs of configurations separated by a constant time lag Δt as described in Section 2.2.1. In this way we obtain more stable estimates (averages) which are nevertheless perfectly specific for a given time interval Δt.

Employed Structures.
We applied the proposed methodology to a total of 6 MD simulations. For this purpose we used 2 different TCRpMHC complexes: an immunogenic wild-type peptide bound between TCR/MHC and the same TCR/MHC with a less immunogenic mutant peptide. We employed the crystal structure of the LC13 TCR bound to HLA-B * 08 : 01 and the Epstein Barr Virus peptide FLRGRAYGL. This structure is available from the Protein Data Bank (PDB) [41] via PDB accession code 1mi5 [42] and is referred to as wild-type. For the mutant complex we substituted the side chain of tyrosine at position 7 of the peptide to alanine (Y7A). This was performed using SCWRL [43] since we could previously show that this tool is most appropriate for mutations in pMHC complexes [44,45].
The LC13 TCR in complex with the FLRGRAYGL peptide and HLA-B * 08 : 01 is an ideal test set for molecular dynamics simulations since this complex was crystallized and described in its parts and as a whole. Initially Kjer-Nielsen et al. crystallized the TCR [46] and the MHC [47] separately while they published a structure of the whole TCRpMHC system [42] afterwards. These available data give substantial insight into the LC13/EBV/HLA-B * 08 : 01 system and led to the choice of our wild-type and mutated system for this study. The whole system is illustrated in Figure 2.

Molecular Dynamics Simulation
Protocol. The wildtype and the mutant complex were simulated in independent runs for 10, 50, and 200 ns yielding a total of 6 simulations (see Table 1). The following protocol for the simulations was employed. First, we minimized the energy of the systems using a steepest descent method. Then we immersed the complexes in explicit SPC [48] artificial water baths allowing for a minimal distance of 2 nm between the box boundary and the protein. Next, we warmed the complex up to  Trajectory 1  10  3  FLRGRAYGL  3500  Trajectory 2  50  50  FLRGRAYGL  1000  Trajectory 3  200  50  FLRGRAYGL  4322  Trajectory 4  10  3  FLRGRAAGL  3500  Trajectory 5  50  50  FLRGRAAGL  1000  Trajectory 6 200 50 FLRGRAAGL 4322 310 K using position restraints. Finally, we carried out the simulations using GROMACS 4 [49] and the GROMOS96 force field [50]. All further parameters were set in accordance with [51].

RMSD Calculation.
After the simulations were finished, we calculated the RMSD values for each configuration in a given trajectory with respect to every other configuration of the same trajectory using the standard g rms function of GROMACS. This yields an n × n matrix of RMSD values where n is the number of configurations in the trajectory.

Averaging and Modeling of Lagged RMSD.
Given one configuration x(t i ) of an MD simulation as a reference, the RMSD to some other configuration x (t j ) may be considered a "distance measure" along the time interval Δt = t j − t i . If Δt is short enough, it may be shifted along the whole trajectory and RMSD values be sampled. The average RMSD(Δt) is characteristic for the difference between configurations separated by Δt in the particular simulation run considered. Small values of Δt characterize configurations close to each other in time. Increasing Δt means to compare configurations more distant to each other in time, which are-intuitively speaking-"less related" to each other. This fact should be reflected in RMSD(Δt) for increasing values of Δt.
As Δt increases, dependences should diminish and approach the level of "unrelated" or "independent" configurations. In order to quantify such a saturation trend, we applied the Hill equation [52]: where the parameter a reflects the maximum value to which the function is asymptotic ("plateau value" RMSD(Δt → ∞)), τ is the time lag Δt for which RMSD(Δt) = a/2 (i.e., the value of Δt for which half-saturation is achieved), and the Hill coefficient γ is a parameter that determines the shape, that is, the level of sigmoidicity, of the model functions. The parameters were estimated by fitting the Hill equation (2) to the measured values of RMSD(Δt) using the nlinfit function as implemented in the Statistics Toolbox of MATLAB (Mathworks, Natick, MA, USA). The maximum time lag Δt was chosen half the total simulation time of the respective trajectory.

Assessing the Influence of Initial
Conditions. The initial phase t ≤ t offset of each MD simulation strongly depends on the starting configuration, usually a crystal structure, which can by no means be representative for a configuration obtained from a trajectory. This is even true after energy minimization and warming up. Hence, if the initial phase of an MD trajectory is included in RMSD(Δt), a bias will result. This is not only true for individual values of RMSD(Δt) but also for the parameters estimated from the fit. In particular, the limiting plateau value a = RMSD(Δt = ∞) will also be biased and depend on t offset : in fact, a = a(t offset ). We modeled this dependence as where a 0 is the limiting value, a 0 = a(t offset = ∞), and β and λ are scaling parameters. Note that a 0 is an extrapolated estimate for RMSD(Δt = ∞) and t offset = ∞, that is, the estimate for an RMSD between two totally unrelated configurations of a trajectory, independent of initial phase effects.
Values for t offset were selected in the interval 0 ≤ t offset ≤ t max /2, where t max denotes the total simulation time of the respective trajectory.

Convergence of RMSD with Increasing Time Lag.
Each panel of Figure 3 shows a representative plot (for reasons of conciseness we display results in the figures only for trajectory 1-3) of mean RMSD values (red circles, y-axis) obtained between configurations of one MD trajectory (see figure caption and Table 1), separated by respective time lags Δt (x-axis). As the time-lag between configurations increases, mean RMSD approaches a plateau.  Table 1. Note that due to the different total durations of the simulations, the maximum time lag considered (equal to half the simulation length) varies. Vertical lines indicate the time interval τ, for which half-saturation is achieved; see also (2). The horizontal line indicates the estimated plateau of the mean RMSD, corresponding to the parameter a in (2). Fits of the model (2) to the values of RMSD(Δt) are displayed as solid lines. Parameters obtained from the fits can readily be interpreted as follows. The estimate for parameter a in (2) represents the limiting value of RMSD(Δt) and is indicated by the horizontal line. The estimate of the parameter τ corresponds to a "characteristic" ("halfsaturation") time interval Δt and is shown as a vertical line in the plots.
Note that the initial phase of each trajectory strongly determines the shape of RMSD(Δt) which should be properly represented by the fitted model. In contrast, the remainder of each trajectory is characteristic for the longterm trend of RMSD(Δt). Although it shows much smaller changes in RMSD, the remainder contains naturally by far more data points as compared to the initial phase. To achieve an appropriate balance, we increased the lag length in small steps during an initial phase (cf. the initially dense succession of red circles in Figure 3) and in larger steps (more loose succession of circles) later on. This procedure puts increased weight on the data points for the initial phase.

Influence of the Offset.
Applying different temporal offsets t offset before starting to analyze the respective trajectories changes the dependence of RMSD(Δt) as a function of the time lag Δt; see the typical results displayed in Figure 4.
The larger the offset t offset from the start configuration, the smaller the time lag Δt necessary for the system to level off to its RMSD plateau. The exemplary display of the dependence of fitted parameters on t offset , as shown previously, is systematically analyzed as follows; see parameter γ is fairly constant and almost independent of t offset .
In a next step the systematic dependence of the extrapolated plateau on t offset was fitted via (3); see Figure 6. The monoexponential decay model of (3) represents the most parsimonious choice, given the shape of the simulation results as displayed in Figure 6.
Figures 6(a)-6(c) illustrates the influence of the offset t offset from the start configuration on the respective RMSD plateau values as estimated from the parameter a of the model (2). RMSD plateau values tend to decrease with increasing t offset .

Conclusions
For molecular dynamics simulations of proteins, questions of "convergence" and sampling are important issues if sensible conclusions are to be drawn from such simulations. Since convergence of a particular simulation (in the sense that statistical sampling of the phase space is complete enough with respect to the specific phenomena studied) cannot be judged in advance [39], various techniques have been advised to demonstrate that a simulation has not yet converged [37,[53][54][55][56].
Here, we propose a method to assess if a simulation has not yet run long enough, based on RMSD analysis of successive configurations of a given MD trajectory, separated by time lags Δt of varying length (up to half the total simulation time). As long as a simulation has not yet converged, the shape of the function RMSD(Δt) still considerably depends on t offset , that is, the time point along the trajectory, where the analysis interval Δt starts. As t offset is large enough, the shape of RMSD(Δt) becomes stationary; that is, in Figure 4 the shape of the mean RMSD curve is different in the left and right panel, but further increasing t offset would not further change its overall appearance. This is also reflected in the respective model parameters, converging to constant values for large values of t offset ; see Figure 5.
To describe this Δt dependence of the mean RMSD values, RMSD(Δt), for a given t offset , the Hill function was used. This function type is frequently applied in enzyme kinetics to model saturation phenomena together with the number of reactive sites on an enzyme. In the present work, the Hill function proved flexible enough to model the functional form of RMSD(Δt) and to identify the respective parameters (plateau value, shape parameter, and half-saturation time). Contrary to a simple exponential saturation function, the Hill function is able to model different levels of sigmoidicity.
In order to quantify the influence of the initial conditions and the equilibration phase on RMSD(Δt), we systematically increased the time t offset before starting to analyze each trajectory. With the exception of the shape parameter γ all the other parameters of the Hill function describing RMSD(Δt) exhibit a distinctive dependence on the initial phase of the trajectories and thus indicate that biased estimates would result if the initial phase would be included in the analysis.
As pointed out in the introduction, the method proposed here combines the smoothing effect of averaging, while retaining the specificity of a precise time interval between configurations being compared. For example, the height of the extrapolated RMSD plateau a 0 (see Figure 6) may be interpreted as "configurational distance" between two arbitrarily selected, totally unrelated configurations of the particular part of phase space currently visited. If one chooses an arbitrary configuration out of it as the reference, all others visited by the trajectory will have an average distance a 0 . In contrast, Figure 1 shows two limiting cases of reference configurations: (i) the configuration with maximum average distance (i.e., RMSD) to all others (blue curve),  Table 1. which represents the "maximum outlier" ever seen in this trajectory and (ii) the configuration with minimum average distance (i.e., RMSD) to all others (red curve), which is "most central" within the trajectory.
Note that extrapolated plateau values are significantly larger than the RMSD(Δt) actually reached in the trajectories (see Figure 3). Only if we consider the extrapolated plateau as a function of t offset , we obtain realistic (i.e., lower) estimates (see Figure 6) to aim at during actual simulations. In this sense, we may attribute the proposed fitting procedure some forecast capability regarding the level of RMSD to be finally expected if the trajectory were carried on. This extrapolated mean RMSD corresponds to pairs of configurations separated by time intervals large enough to consider such configurations approximately uncorrelated, independent representatives of the configuration space of the respective molecule.
Likewise, the "half-saturation time" τ obtained from the fit decreases with increasing t offset . Thus, taking configurations with large enough t offset as a reference, it takes only a short time to get close to the RMSD plateau a 0 . In each panel of Figure 5, the parameter τ approaches an almost constant level at about half the maximum t offset considered (i.e., 25% of the total simulation time). This might suggest that-regardless of the total simulation length-the fraction of usable (independent) configurations remains the same (final 75% of the trajectory). As the length of the simulation run increases, the criterion derived from the run itself gets increasingly more stringent. From the 200 ns run the first 50 ns should be discarded, which is more than 5 times the  Table 1. Red circles represent plateau values a(t offset ) the error bars denote their asymptotic standard errors. The solid blue curve shows a nonlinear least-squares fit of (3) to the plateau values a(t offset ). From the latter fit the limiting plateau value a 0 = a(t offset = ∞) was extracted and is shown as a solid horizontal line together with its asymptotic standard error as obtained from the fit of (3). total length of the 10 ns trajectory. If using the latter as a basis of estimate, however, the last 7.5 ns seem to be trustable.
Although the analysis reported in the present work is specific to TCRpMHC complexes, we expect the method of lagged RMSD analysis to be applicable to similar molecular systems, such as membrane proteins comparable in size and structure. The approach presented here is designed to assess the degree of convergence of MD simulations and hence the statistical quality of conclusions drawn from such simulations.