Finding Semirigid Domains in Biomolecules by Clustering Pair-Distance Variations

Dynamic variations in the distances between pairs of atoms are used for clustering subdomains of biomolecules. We draw on a well-known target function for clustering and first show mathematically that the assignment of atoms to clusters has to be crisp, not fuzzy, as hitherto assumed. This reduces the computational load of clustering drastically, and we demonstrate results for several biomolecules relevant in immunoinformatics. Results are evaluated regarding the number of clusters, cluster size, cluster stability, and the evolution of clusters over time. Crisp clustering lends itself as an efficient tool to locate semirigid domains in the simulation of biomolecules. Such domains seem crucial for an optimum performance of subsequent statistical analyses, aiming at detecting minute motional patterns related to antigen recognition and signal transduction.


Introduction
Molecular dynamics (MD) can be used to investigate functional elements in biomolecules [1][2][3][4][5]. In addition to static structures (such as crystal structures stored in the protein data bank (PDB) [6]) molecular dynamics yields information on dynamic properties [7,8], lending themselves for evaluation of, for example, signal transduction. However, key patterns of motion related to such a functional element may be hidden among a large amount of "other" movements, reflecting no more than ordinary thermal motility of the biomolecule. Molecular dynamics itself can be carried out along relatively standardized protocols [9,10]. However, recognizing specific patterns of motion, which are deemed crucial for a functional element, remains a tricky task, requiring sophisticated statistical methods [11], such as principal component analysis [12,13] or normal mode analysis [14].
For all the mentioned approaches, an initial and essential step is the "fitting" of the molecular structure of each time step of an MD trajectory (henceforward called frame) to a reference structure, x ref [15]. A given frame x is first translated to let its centre of mass coincide with that of the reference frame. Then x is rotated (around its centre of mass) to minimize square deviations between corresponding atoms of x and x ref : In many approaches, RMSD has been used not only for fitting but also for directly (and successfully) quantifying molecular deformations [15], including structural changes, drifts, and trends [16]. In many cases, however, even sophisticated statistical methods fail, when applied to MD-frames after fitting. The suspicion is that the process of fitting itself might cause this failure. How does this come about? By default, the GROMACS [17] fitting procedure uses atomic masses as weights for superposition of a structure's atomic coordinates to a reference structure. Accordingly, the fitting of x "as a whole" is being optimized. In some cases, fitting the whole molecule may be inadequate and even conceal what one is searching for. For example, consider a molecule with one or more flexible loops. While the body of such a molecule behaves like a slightly deformable, rigid body, a loop may be conformationally flexible exhibiting largely uncorrelated movements with respect to the rest of the molecule. In the fitting criterion, however, atoms within the loop and those in the body may have equal weights. Since all deviations enter quadratic into (1), large movements of an even small number of loop-atoms may generate dominant contributions to the RMSD. In such a case, in order to minimize total RMSD, x is rotated predominantly to accommodate for the few atoms within the loop. As a result, the large remaining body of the molecule has to "follow its own loop, " as if the tail chases the dog [maybe this was not the primary intention of fitting]. Needless to say, due to such movements caused by fitting, that minute motile elements may become totally submerged, without any chance of being retrieved from the trajectory, not even by sophisticated statistics.
The described situation is typical and demands more elaborate fitting methods. Choosing unequal weights suggests itself as a nearby and convenient solution. The more rigid parts of the molecule should receive more weight, the flexible ones less. However, how should one know, prior to fitting, which parts are semirigid and which are flexible?
One possibility would be a two-pass procedure, in the first pass fitting to the whole molecule with uniform weights ( = 1) and evaluating the RMSF : where ⟨⟩ denotes the average over a trajectory and x ,ref denotes a reference position of atom , not changing over time. Note that RMSF will highly depend on the choice of the reference position, which is usually the mean coordinate of atom over the whole trajectory. Then, in a second pass of fitting, weights are chosen inversely proportional to the RMSF , as reported by [18]. Highly motile atoms receive less weight and lose their role in shaking the remaining main parts of the molecule. However, this method suffers from the fact that RMSF depends on the selection of x ,ref in the first pass of fitting; that is, the correction procedure depends on the error it is supposed to correct. Another possibility is the identification of semirigid domains (clusters) within the molecule, as reported by [18]. In particular, the definition of clusters may be based on distances = ‖x − x ‖ between pairs of atoms rather than coordinates computed in the trajectory. The standard deviation of distance variation (STDDV) between an atomic pair ( , ) is given by where is the number of atoms, ⟨⟩ denotes the average over a trajectory, and is measured in nm. Evidently, pairdistances are unaffected by any kind of arbitrariness due to fitting.
Given a number of clusters ( clust ), let denote the partial class membership of atom in cluster . For normalization we require The following criterion has been proposed to identify an optimum decomposition into a given number ( clust ) of clusters [18]. Minimize the target function: under the constraints of (4). Once identified, any such cluster may be used as "primary fitting domain, " by assigning large weights to the atoms therein. With little motion within such a cluster, the motility of the remaining atoms of the molecule will appear relative to that cluster. This generally increases the chance of tracing relevant patterns of motion outside the cluster, for many statistical methods being applied. The important difference from the known structureanalyzing tools of GROMACS, whether based on RMS deviation after fitting or RMS deviation of atom-pair distances, is that there is no need of a reference structure here. Also the clustering algorithms themselves, though with different criteria, assign conformations to a cluster for the molecule as a whole, while in our work, groups of atoms are assigned to the same cluster if their mutual distances vary little over time (a spatial clustering within the molecule).
MD trajectories for protein complexes were analyzed by clustering of averaged standard deviations of distance variation (STDDV); see below. Obtaining the most rigid cluster of atoms can be seen as the first step to facilitate the search for protein motions.

Construction of Complexes for Molecular Dynamics Simulation.
We applied the clustering algorithm to a series of molecular systems as follows, see Table 1.
LC13 T cell receptor (TCR) in complex with major histocompatibility complex (MHC) HLA-B * 44:05 and the ABCD3 peptide (EEYLQAFTY) has been successfully crystallized by Macdonald et al. [20] and its structure is accessible on http:// www.pdb.org/ assigned the PDB ID 3KPS. However, there are no structure files of LC13 TCR in complex with HLA-B * 44:02 and HLA-B * 44:03. Therefore, we applied homology modelling to create these structures.
In-silico mutagenesis was carried out using Swiss PDB Viewer [21].   [19]). HLA-B * 44:05 was used as a template for homology modeling, because a three-dimensional structure of this MHC in complex with ABCD3 peptide and LC13 TCR was available. Sequence alignment was done with CLC bio's CLC sequence viewer. Note that sequence numbering from PDB (PDB numbering) and IMGT/HLA database (HLA numbering) differ.

Molecular Dynamics Simulation
Protocol. The workflow of the molecular dynamics simulation of the penta-L-alanine system is closely related to that in the work of Bernhard and Noé [18]. MD simulation of penta-L-alanine was performed using GROMACS 4.0.7 [17] according to the following protocol. First, we immersed penta-L-alanine in an explicit SPC [22] artificial water bath (cubic box) allowing for a minimum distance of 1 nm between peptide and box boundaries. Second, we minimized the solvated system using a steepest descent method. Next, we warmed up the system to 293 K during a 100 ps position restraint MD simulation. Finally, we carried out the MD production run with LINCS constraint algorithm acting on bonds with hydrogen atoms using an integration step of 2 fs and the GROMOS96 53a6 force field [23]. Coordinates were written to the trajectory every 2 ps. Coulomb interactions were computed using Particle Mesh Ewald (PME) with a maximum grid spacing of 0.12 nm and interpolation order 4. Both, Van der Waals and Coulomb interactions were computed with a cut-off at 1.4 nm. Berendsen temperature coupling to 293 K and Berendsen isotropic pressure coupling to 1 bar were used. All further parameters were set in accordance with Omasits et al. [24].
MD simulation of TCR/pMHC systems was performed using GROMACS 4.0.7 [17] according to the following protocol. First, we immersed the TCR/pMHC complex in SPC [22] artificial water bath (cubic box) allowing for a minimum distance of 2 nm between complex and box boundaries. Second, we added sodium and chloride ions to a concentration of 0.15 mol/L, and at the same time neutralizing the net charge of the system. Third, we minimized the energy of the solvated system using a steepest descent method. Next, we warmed up the system to 310 K during a 100 ps position restraints MD simulation. Finally, we carried out MD production runs with LINCS constraint algorithm acting on all bonds and using the GROMOS96 53a6 force field [23]. Hydrogen motions were removed allowing for an integration step of 5 fs. Coordinates were written to the trajectory every 50 ps. Coulomb interactions were computed using Particle Mesh Ewald (PME) with a maximum grid spacing of 0.12 nm and interpolation order 4. Both, Van der Waals and Coulomb interactions were computed with a cut-off at 1.4 nm. Velocity rescale temperature coupling to 310 K and Berendsen isotropic pressure coupling to 1 bar were used. All further parameters were set in accordance with Omasits et al. [24].

Optimization of Cluster Membership.
Each atom in a molecular dynamics simulation may be uniquely assigned to one of the clust clusters considered [7,25,26], represented by a "crisp" vector of cluster-membership; for example, = [0, 0, 0, 1, 0, 0] if atom belongs to cluster 4 out of clust = 6 clusters. Alternatively, each atom may be considered to belong to several clusters simultaneously, represented by fuzzy, noninteger memberships, with normalization condition see (4). Fuzzy memberships are the more general case, it seems that they might yield lower minima of the target function than crisp memberships and should therefore be preferred. Interestingly, Bernhard and Noé [18] report that fuzzy memberships, upon optimization with a gradient method, tend to end up as crisp, that is, either 0 or 1. We have scrutinized this issue and will demonstrate how this comes about. Even more, as one of the main results of this work, we will prove that the solution has to be crisp. The proof is given via mathematical arguments; see results section. This finding allows us to restrict the search space to crisp memberships, without diminishing the generality of the optimization problem posed.

Optimization of Crisp Cluster Memberships by a Two-Stage Monte Carlo Method.
Initially, the number of clusters, clust , is chosen and the target function (5) has to be minimized. We will show that, under certain assumptions (see Section 3.1), it is sufficient to search in In a less formal formulation, the objective is to assign each of the atoms to one particular cluster (crisp memberships). It may become quite tricky to attack such a problem with an analytic gradient approach since the boundary conditions are usually difficult to handle and the search domain consists of isolated points. We have therefore chosen a two-step search, in which a random process is succeeded by an exhaustive search.
Every constellation (i.e., Monte Carlo trial for improvement) which cannot be improved by a single move of one of the atoms from one cluster to another is considered a result (minimum constellation). The result with the lowest (c) is the ground state, but all other minimum constellations should also be included in the further analysis.

Search Algorithm
(i) Start-Up. Initially, each of the components is randomly assigned to one (of the clust ) cluster.
(ii) Random Search. In the first step, each of the components (atoms) is moved from its current cluster to another randomly chosen cluster with probability . If this mutation yields a reduction in (c) the new constellation is preserved; otherwise, it is rejected. This process is repeated times. A very rudimentary benchmarking analysis has shown that = 1/ and = ⋅ clust are reasonable values to use.
(iii) Exhaustive Search. In the second step, the algorithm tries to improve (c) by single step moves for each component separately. If there is no possible move to improve (c), the constellation is necessarily a local minimum in our sense.
(iv) Ground State. Usually there is a large number of minimum constellations in the above sense. For a matrix with significant structure, the ground state will be reached after only a few trials. For ill-conditioned matrices (those without structure), a minimum constellation very close to the ground state will also be found after a few trials, although the absolute ground state might be difficult to find.

Crisp Cluster Membership as a Necessary Consequence.
We formulate and prove a lemma that crisp memberships are a necessary consequence of the topology of the multidimensional space of pair-distance standard deviations. Due to its definition, S is a symmetric, nonsingular × adjacency matrix whose entries are the standard deviations explained earlier (3); thus, > 0 for ̸ = and = 0. Let The objective is to findĉ = argmin (c) for c ∈ Ω.
To prove this lemma one uses Lagrange multipliers: Since (c) is a polynomial of order 2, the derivatives with respect to and yield a system of × ( clust + 1) linear equations of the form:  therefore, under the assumption det (S) ̸ = 0, there must be a unique solution, which is given by Unfortunately, this solution yields the maximum of (c) for c ∈ Ω. However, since Ω is convex and bounded, one knows that any argmin (c) must be on Ω. Therefore, there must be at least one = 0. Reducing the above system of linear equations by this constraint, one sees that the revolving system again has a unique solution: which again gives a maximum of (c). The derivation of the value of is rather involved and not shown.
With the same argument as before, one can proceed by setting all equal to zero with the exception of one for each . This iterative procedure shows that clustering with respect to (c) leads to a unique assignment of each of the components to one particular cluster.

Heuristic Evaluation of Solution Space around the Minima.
Our theoretical result, that memberships are crisp, can be illustrated very intuitively; see Figure 3. Left and right end of -axis correspond to constellations with minimum target function, located at the boundary, and no other minimum is found in between.

Clusters of Atomic Motions in MD Trajectory.
For above mentioned TCR/pMHC complexes B4402 and B4403, STDDV matrices were computed from the MD trajectories; see Figure 4. Only the second parts of the trajectories (corresponding to approx. 125 ns simulation time) were considered to exclude relaxation effects; see also Section 3.6.1.
STDDV matrices were clustered as described above for clust = 2 to 6. After computation, clusters are renumbered according to size (=number of atoms), the largest one always being labelled as cluster 1. For clust = 5, 6 cluster memberships for B4402 and B4403 were remapped onto the protein structure and displayed in VMD [27]; see Figure 5.
Note that NO information whatsoever about secondary structural elements, such as -helices and -sheets, has entered the clustering procedure. Still, clusters more or less seem to retrieve some of these structural elements; see Figure 5. This could be related to extensive hydrogen bonding in -helices and -sheets stabilizing these secondary protein structure elements. How could bond constraints in MD simulations influence the resulting clusters? In our calculations, we just considered the protein backbone, because amino acids side chains show larger spatial fluctuations. The backbone atoms are separated by a planar and rigid amid bond, so neighboring atoms will experience less variation in distance.

More Clusters Improve Target
Function. The number of clusters has to be preselected in our approach. If we had just one cluster, the total distance variability contained in matrix S would be part of that cluster. Increasing the number of clusters generally reduces the fraction of variability contained within clusters, expressed as percentage of total in Figure 6.

Larger Clusters Turn
Out to Be More Rigid. Clusters were constructed to achieve maximum internal "rigidity, " that is, a minimum sum of pair-distance standard deviations. One might expect that large clusters, since they accommodate many atoms within larger spatial domains, should turn out to be less rigid then smaller clusters. However, the opposite is true: larger clusters turn out to be more rigid; see the declining trend of normalized STDDV with increasing cluster size in Figure 7. This demonstrates again that structures in motility are captured via clustering. If there is no structure within matrix S, normalized cluster sizes would result nearly equal, that is, centered around 1.
Clearly, clusters do not result identical for different subsections of a MD trajectory. In Figure 7, data are shown for two trajectories and 50 subsections each, each clustered for clust = 5 and 6. For details, see legend of Figure 7. 100 Monte Carlo attempts were performed for each clustering, out of which the optimum (smallest target function) was adopted. These results confirm the general trend that larger clusters are more rigid. An overview of dispersions within and between clusters is given in Figure 8, the numerical results for 6 clusters being given in Table 2. 3.6. Stability of Clusters. Clusters have been evaluated regarding stability, in order to check whether they lend themselves as reliable semirigid domains for fitting MD-configurations. Of note that (at least) two sources of variability of cluster memberships need to be scrutinized as follows: (i) variability due to the stochastic nature of the Monte Carlo clustering method and (ii) variability due to different parts of an MD trajectory being clustered.
We will demonstrate that variability due to our Monte Carlo clustering method is negligible. As opposed to that, the "adequate" choice and preparation of the MD trajectory has tremendous impact and remains an issue of a never ending debate [28][29][30][31][32].

Variability between Different Parts of a Single MD Trajectory.
Adequate sampling of phase space is essential regarding MD-simulations [33]. Much work has been done to detect changes, drifts, and trends as markers for inadequate sampling [16,29,34]. Block averaging was proposed as one of the remedies [35]. In this work, the clustering presented above was based on matrices S computed from whole 250 ns MD trajectories. Clearly, the matrices S( 1 , 2 ) for each subset ( 1 , 2 ) of a trajectory would be different, entailing different results for clustering. The question is which is the most reliable clustering for a given molecule?
To answer this question, we shall quantify the variability of clusters for subsets, relate it to the result for the whole trajectory, and derive a "stiff kernel, " that is, those atoms which do not (or very rarely) change clusters between subsets of the trajectory.  Figure 8. Off-diagonal values relate to STDDV between clusters, corresponding to right marker in Figure 8.   Optimum solution for clust = 5 is shown in red, for clust = 6 in green (see Figure 5 for 3D visualization of clusters in the protein complexes). Vertical axis: STDDV within clusters was averaged and multiplied by clust , in order to be comparable. Horizontal axis: normalized cluster size = 1 means that the number of atoms in a cluster exactly matches the average / clust . First, we inspect the total dispersion contained within , evaluated separately for 50 subsets of 5 ns each (i.e., 100 frames out of 5000 frames in a whole trajectory); see Figure 9. In this trajectory we observe an irregular oscillating time behavior, starting with a declining tendency. The existence of such a substantial initial phase indicates that the influence of the starting configuration does not die out until after (roughly) half of the total simulation time has passed by. To account for this fact in a heuristic way, straight lines were fitted to the first and second half of the trajectory, allowing for some overlap. This accommodates with the finding of our previous work [29] that only the second half of the trajectory can be considered an unbiased sample from phase space and should be taken for further evaluations.  The total trajectory of 250 ns was split into 50 consecutive subset-trajectories and clustering performed for each of them, prescribing the number of clusters between 2 and 6 (see coloured legend). For each clustering, the resulting averaged STDDV is plotted against the vertical axis. Averaged STDDV have been multiplied by clust (as in Figure 7) to make them comparable between different numbers of clusters. Generally, more clusters entail smaller cluster size and thus reduce total motility captured in clusters. Variability of STDDV between subtrajectories illustrates the dynamical character of clustering and its dependence on the phase space sampling stage, however, with a pronounced convergence tendency of local values.  Figure 11: Migration of atoms between clusters. 50 subset-trajectories have been clustered (example shown for B4402, clust = 2). After relabelling (according to optimum permutation), the number of migrating atoms with respect to previous clustering is shown. Along this trajectory, two episodes of massive migration occur (around 100 and 220 ns, resp.). However, migrations turn out to be almost reversible; that is, most atoms finally end up in their former clusters (those of sub-trajectory 1), only about 50 (out of 826) do not.
for each of the 50 subset-trajectories (5 ns each), yielding an assignment for each atom to one of the clusters. Note that clusters were primarily labeled (cluster 1, 2, etc.) according to their size in that particular clustering (see Figure 10). Thus, the following situation may occur. Given a clustering of a subset-trajectory with first and second cluster about equal in size. Then, when clustering the following subset-trajectory, a few atoms from cluster 1 may end up in (i.e., "migrate" into) cluster 2, which may suffice to make cluster 2 now the largest cluster and therefore receiving the label "cluster 1" in the second clustering. This would yield a very peculiar result. The few "migrating atoms" would formally belong to the same cluster (in this example cluster 1), while the majority of atoms would switch between clusters 1 and 2. To avoid this misleading and undesired artifact, we refined the procedure as follows. In the first clustering, clusters were assigned labels (1, 2, etc.) according to decreasing size. After each subsequent clustering, we evaluated all permutations of cluster labels regarding the number of "migrating atoms" with respect to the first clustering. That permutation of labels, which yielded a minimum of migrating atoms, was finally adopted. As an example, the resulting number of migrating atoms is shown for B4402 and clust = 2 in Figure 11. Figure 11 the number of migrating atoms was considered. Now, we evaluate which atoms migrate. For quantification, we resort to the Kullback-Leibler-distance [36,37]:

Quantifying the Stability of Cluster Assignments. In
1/ clust represents the assumed background probability if the assignment of atom i to any of the clusters were equally probable. is the actual probability for atom i to belong to cluster m, estimated from an average over cluster memberships obtained from clustering subsets of the trajectory: Large values of KLD indicate that atom stays predominantly in the same cluster throughout the trajectory. On the contrary, values of KLD close to zero indicate a random distribution of an atom between all clusters. Figure 12 shows KLD for B4402 and clust = 5.

Clustering Reflects Structure within STDDV-Matrix.
Target function (5) only counts distance variability within the clusters, not between atoms belonging to different clusters. Thus, if we reorder atoms according to their cluster membership, clusters appear as squares along the main diagonal of the matrix S; see Figure 15. If we hypothetically assume that elements S are more or less homogenously distributed across the matrix, the "area" of each cluster in the matrix will roughly correspond to the variability within that cluster. Clearly, these squares have to be of equal size to make their joint area minimum (and thus minimize the total distance variation within clusters).
We have verified this prognosis by randomly rearranging elements of matrix S and then performing the clustering procedure. Cluster sizes resulted almost equal (for illustration see left panel of Figure 13) in each of 20 trials of random rearrangement and clustering. This finding was verified for 2 ≤ clust ≤ 6 (data not shown).
As opposed, clusters of very different sizes resulted for matrices S derived from MD simulation (illustrated in right panel of Figure 13). This clearly indicates that clusters of unequal size do not result by chance but reflect distinct dependencies within S. The sensitivity of clustering to existing structures within S is also reflected in target function (S), as can be seen by comparing matrices S from MD and their randomly rearranged counterparts; see Figure 14.
In both cases averaged STDDV decreases (improves) with increasing number of clusters. However, in presence of real dependencies between atomic mobility, clustering achieves much more improvement.

Many Small Clusters Are More Rigid but Cover Less Distance Variation.
One might ask why a larger cluster number is less favorable. Obviously, the extreme case of defining each single pair of atoms as a separate cluster would lead to minimum STDDV within clusters but would represent a trivial and useless solution. Note that STTDV within clusters decreases with increasing clust , which makes clusters more homogeneous and is a desired effect. However, at the same time the overall amount of variability caught within clusters also decreases, which is an undesired effect, since larger portions of the molecule are disregarded. These trends are reflected by the total area of coloured squares along the diagonal in Figure 13. The smaller the squares (with increasing clust ), the smaller their total area, even if there are more squares (the area decreases quadratic with the side lengths of squares).
This tendency has been quantitatively demonstrated for real MD-data in Figure 6. On top of that, Figure 14 shows that the same trend also holds for unstructured matrices, as mentioned above. However, results also show that matrices without structure (randomly rearranged elements) allow for very little reduction in the STDDV covered within clusters, as compared to structured matrices and that this difference further increases with clust .

4.3.
Clustering Is Stable and Self-Contained. Powerful statistics on MD-trajectories (which is, e.g., able to detect small motions related to signaling) needs careful fitting of configuration frames as a prerequisite. Fitting to a domain which should be as rigid and large as possible is one of the options.
Hence, finding large clusters is desirable. However, large clusters in general might prove unstable. Therefore, we have carefully investigated this issue for the target function proposed by Bernhard and Noé [18] in conjunction with our clustering method.
It turned out that larger clusters are even more stable (see Figure 7), which is a strong indicator of stability and selfcontainment of our results.

Ergodicity.
Atomic motions in a large molecule constitute a highly dimensional phase space, and MD-simulations can in most cases explore only part of the total phase space. At least, one can never be sure about the exact fraction of phase space actually visited in a specific simulation run. As a consequence we use only the second half of our MD trajectory for final clustering (see Figure 5), in accordance with our previous work [29].

Computational Resources.
Clustering takes very little iteration steps for A 5 , due to monotonous, continuous relationships between elements of S; see the appendix. Note that the rapidity of clustering in this case is not only a primary consequence of the small number of atoms but a matter of simplicity in structure.
Randomized matrices S take significantly more iterations for clustering, since many minima are almost equal regarding the target function, rendering solutions ambiguous. As opposed to this, clustering large molecules with structured internal motion, such as B4402 and B4403, yields welldefined minima after a reasonable number of trials.

Cluster Interpretation.
There is an obvious difference in visual appearance between STDDV matrices for B4402 and B4403 as seen in Figure 4. Trajectory B4402 yields an unstructured, rather flat STDDV matrix (upper panel in Figure 4), while B4403 shows a distinctly structured STDDV matrix (lower panel in Figure 4). The relation between cluster rigidity and size is illustrated in Figure 7 and shows a clear trend: cluster rigidity increases with increasing cluster size, reflected in a decreasing STDDV within clusters. However, this does not mean that the largest cluster is always the most rigid cluster (i.e., has lowest STDDV). For clust = 5, we see that in B4403 the largest cluster is at the same time the most rigid one. For clust = 5 in B4402 the second largest cluster is   Comparison between data from MD trajectory B4402 and randomly rearranged matrix elements . Black graph "B4402" relates to matrix S obtained from a real MD-simulation, whereas the red graph "B4402 randomly rearranged" stems from a matrix with randomly rearranged elements (STDDVs). the most rigid one. For clust = 6 the situation is inverted: in B4402 the largest cluster is the most rigid one. In B4403 the second largest cluster is most rigid.
All in all, the most rigid cluster was always found among the two largest clusters. The mappings of the clustering results for clust = 5 and clust = 6 for B4402 and B4403 have been displayed in Figure 5.

Conclusion and Prospects.
A main result of the paper is the finding that the target function proposed by others [18] has only crisp solutions; that is, each atom belongs to one single cluster only (and not to several clusters in a fuzzy sense). This finding allows for a much more efficient search for optimum clustering.
Based on this new finding, the process of clustering was evaluated regarding various aspects to provide concomitant information for possible application by other investigators. Applicability was demonstrated for two trajectories of 250 ns each for large biomolecular complexes whose dynamics is of key importance for the understanding of immune reactions.
Further improvements can be expected from a more detailed investigation of the Kullback-Leibler distance [36,37]. In this work it has only been reported as a means for assessing the quality of clustering by some given method. In future work, the Kullback-Leibler-distance may enter the clustering procedure itself and render clusters even more stable between subtrajectories and over time.

A. Pair-Distance Variations in a Small Molecule
For comparison, we applied our clustering method also to the A 5 penta-L-alanine peptide already analyzed by Bernhard and Noé [18]. They choose penta-L-alanine to evaluate their clustering algorithm on MD simulations for reasons of simplicity of this molecule. A 5 has four amide bonds, each comprising four atoms (CONH). The delocalization of the nitrogen's free electron between conjugated carbonyl and amine groups poses planarity and rigidity onto this structure. Mu et al. [38] showed that A 5 does not remain in an alphahelical conformation, but rather exhibits repeated folding and unfolding events. As expected for such a molecule, our clustering of the averaged STDDV matrix indicates that atoms close together show little fluctuations of their mutual distances, such as the atoms in the middle of the pentapeptide; they are, so to speak, in the centre of the storm. Conversely, atoms near the edges show large distance fluctuations with respect to all other atoms. Clustering such a homogeneous matrix S does not reveal structural elements within the molecule, but rather splits it into equal parts, according to the number of clusters preselected. Note, however, that this behavior is a property of the equally distributed matrix values S and not a general rule.

B. Software Availability
The software used for clustering into crisp domains of preselected number is currently being implemented in Java and will be made available for free download from http:// www.meduniwien.ac.at/msi/biosim/index.php?lang=en& seite=en forLehreIt pairDistanceClustering.