- Journal List
- BMC Bioinformatics
- v.10; 2009
- PMC2639377

# Using least median of squares for structural superposition of flexible proteins

^{1}School of Mechanical Engineering, Purdue University, West Lafayette, IN 47907, USA

^{2}School of Electrical Computer Engineering (by courtesy), Purdue University, West Lafayette, IN 47907, USA

^{}Corresponding author.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## Abstract

### Background

The conventional superposition methods use an ordinary least squares (LS) fit for structural comparison of two different conformations of the same protein. The main problem of the LS fit that it is sensitive to outliers, i.e. large displacements of the original structures superimposed.

### Results

To overcome this problem, we present a new algorithm to overlap two protein conformations by their atomic coordinates using a robust statistics technique: least median of squares (LMS). In order to effectively approximate the LMS optimization, the forward search technique is utilized. Our algorithm can automatically detect and superimpose the rigid core regions of two conformations with small or large displacements. In contrast, most existing superposition techniques strongly depend on the initial LS estimating for the entire atom sets of proteins. They may fail on structural superposition of two conformations with large displacements. The presented LMS fit can be considered as an alternative and complementary tool for structural superposition.

### Conclusion

The proposed algorithm is robust and does not require any prior knowledge of the flexible regions. Furthermore, we show that the LMS fit can be extended to multiple level superposition between two conformations with several rigid domains. Our fit tool has produced successful superpositions when applied to proteins for which two conformations are known. The binary executable program for Windows platform, tested examples, and database are available from https://engineering.purdue.edu/PRECISE/LMSfit.

## Background

Protein flexibility is of great interest due to its essential role in various biological processes. The flexibility of dynamic regions allows a protein to assume multiple conformational states. Protein conformational changes play a critical role in biological functions such as ligand-protein and protein-protein interactions [1-5]. The rigid regions of the protein with highly structural stability will remain relatively unchanged between the multiple conformations in spite of any movement of the flexible regions [2-4]. In order to understand this kind of biological process, it is the first step to find out which regions keep the same and which change between two or multiple conformations. Structural superposition, defined as laying one molecule over the other by appropriate rotation and translation, is a common way to achieve that goal [2,6-8].

Superposition of molecular structures is an essential tool in structural bioinformatics and is used routinely in the fields of NMR, X-ray crystallography, protein folding, molecular dynamics, rational drug design and structural evolution [2,6-8]. The conventional superposition methods treat proteins as rigid bodies and use an ordinary *least squares *(LS) fit, in which the optimal rotations and translations are found by minimizing the root mean square deviation (RMSD) [9-13] between equivalent atom pairs. The LS fit for structural superposition of proteins is also called the RMSD fit. However, proteins are flexible molecules that undergo significant structural changes as a part of their normal function. When flexible molecules in different conformations are fitted to each other as rigid bodies, even strong structural similarity can be missed [14]. One main problem of the conventional LS fit is sensitive to local displacements [2,8,15,16]. In addition, most existing improvements of superposition, which strongly depend on the initial LS estimating for the entire atom sets of proteins [2,7,8,16-20], may fail on structural superposition of two conformations with large displacements. To correct these shortcomings of the conventional LS fit, we introduce a new fit algorithm based on the robust statistics techniques that will be explained later. Our method deals with the superposition of two conformations with small or large displacements without any prior knowledge of the flexible regions.

The heart of comparing two conformations of a protein is an appropriate overlay of the structures for visual inspection, where one protein is typically represented by its virtual *C*_{α }atom chain of residues [2,13,21]. A relatively large number of protein structural comparison algorithms have been presented. They can be roughly categorized into two classes [22]: *structural superposition *and *structural alignment*.

In the structural superposition problems, protein structures are compared with a prior specified equivalence between pairs of residues (such an equivalence can be provided by sequence or threading algorithms, for example) [2,8,22]. The most commonly used superposition algorithm is the LS fit. The RMSD fit is a widely used algorithm to calculate the LS solution for evaluating the fit and quality of superposition [8]. The widely used algorithm to calculate the RMSD fit in matrix form was previously described by Kabsch [9-12]. This algorithm is the basis of most structural comparison methods that overlay molecules. Like most RMSD fitting procedures, the paper only superimposes the *C*_{α }atoms, i.e. residues. Given two proteins composed of *N *atoms each, whose Cartesian coordinates are represented by an ordered set of points {**x**_{1}, ..., **x**_{N}} and a second set {**y**_{1}, ..., **y**_{N}}, respectively. The center of mass of both proteins are at the origin (it is trivial to translate any set of protein coordinates to accomplish this). The RMSD fit problem is then to find an orthogonal 3 × 3 matrix **U **by minimizing the following residual function:

When ${D}_{rmsd}^{2}$ is a minimum, the square root of its value (i.e. *D*_{rmsd}) becomes the minimal RMSD distance between two point sets. An alternative way to represent the two point sets uses two 3 × *N *matrices **X **and **Y**, where the *i*th column of **X **is the vector **x**_{i}, and similarly for **Y**. The RMSD optimization consists of four steps [2,21,23]:

1. Compute a covariance matrix **R **= **XY**^{T}.

2. Calculate the SVD (Singular Value Decomposition) of **R **= **VSW**^{T}, where **V **and **W **are the matrices of left and right singular vectors, respectively, and **S **is the positive semidefinite diagonal matrix singular values of **R**.

3. Compute *χ *= sign(det(**R**)) = ± 1.

4. Calculate the rotation matrix **U **as

An alternative RMSD fit approach uses a compact representation of rotational transformations called quaternions [9,10]. To make the RMSD effectively independent of the number of atoms, Maiorov et al. [13] have proposed a normalization mean. In addition, Wallin et al. [24] investigated and compared the properties of multiple distance measurements related to RMSD. More recently, Theobald et al. [8,16] applied the principle of maximum likelihood to the superposition problem by assuming a Gaussian distribution of the whole structures in the analysis. Additionally, algorithms based on multidimensional rotations and modified quaternions have been developed for structural superposition [25]. However, most existing improvements of superposition are based on the standard LS optimization. To overcome the disadvantages of the standard RMSD fit, some improvement algorithms, such as sieve-fit [19], fit-all [18], and HingeFind [20], are presented based on the iterative least-squares superposition by eliminations of atoms that lie far apart in the superposition. However, these algorithms depend on the initial RMSD fit for the entire atom sets of proteins, which may fail on structural superposition of two conformations with large displacements. Damm and Carlson [2] recently developed a Gaussian-weighted RMSD (wRMSD) fit, which makes use of a weight function for bounding the influence of atoms through an iterative LS fit. In order to overcome the effect of the initial RMSD fit, Damm and Carlson suggested large scaling factors for a global wRMSD fit code and they also recommended the local wRMSD fit on proteins with extreme structural changes. The wRMSD fit can achieve good results. In addition, several authors have reported some techniques for multiple structural superposition [8,16,26-29], where a simultaneous superposition could be employed to avoid biasing the superposition towards a specific (pivot) structure. We limit the study presented in this paper to pairwise structural superposition in term of fitting atomic coordinates of two conformations of the same protein.

Unlike structural superposition, structural alignment aims to compare a pair of structures, where the alignment between equivalent residues is not given prior. Therefore, an optimal sequence alignment needs to be identified, which has been shown to be NP-complete [30]. Many structural alignment methods, such as DALI [31] and CE [32], have been proposed to identify the defined best alignment. The general outputs of structural alignment are a superposition of equivalent atomic pairs and a minimal RMSD distance fitted between two structures. Recently, some methods consider the hinge regions for aligning the rigid subparts of the molecules [14,33]. Structural alignment is often composed of three steps: finding atom pair correspondence (alignment), superposition, and the RMSD calculation. Many structural alignment programs achieve both the correspondence and the superposition, simultaneously. Several papers [8,22] have clearly distinguished the difference between structural superposition and alignment. Although several recent works [25,34] are also named "*superposition*", they are actually related to structural alignment. These publications deal with different topics from our work. A review of many available methods for structural alignment is beyond the scope of this paper. The reader may consult Refs. [14,22,27,32,33] for detailed expositions.

The RMSD fit can be regarded as a LS fit [2,8], that finds a best rotation to fit a given atomic arrangement to approximately measured coordinates. The fit belonging to a statistical method is considered to be *robust *if it has a large *breakdown *point. A breakdown point might be loosely defined as the smallest percentage of outliers that can cause the estimator to take arbitrarily large aberrant values [35,36]. For instance, the breakdown point of the median of a set of values is 50% [36], whereas LS has a breakdown point of 0%. In this paper, we treat the displacements of two conformations of the same protein as *outlier*, i.e. location errors, during the fit process.

Several robust statistics methods have been applied to structural superposition of proteins [2,8]. For instance, Lesk presented the sieve-fit procedure [19] by eliminations of atoms that lie far apart in the initial fit. The algorithm is achieved through an iterative LS procedure as follows [37]. If the calculated RMSD between two point sets is larger than a threshold, the distances between the corresponding atoms in the sets are calculated. The atoms furthest apart are then removed from the original sets and the remaining atoms are superimposed again. This procedure is iterated with one pair of atoms being eliminated in each iteration, until the calculated RMSD is less than the threshold. The Lesk's sieve-fit procedure [19] is unsuitable for superposition between two conformations with multiple rigid domains. HingeFind, presented by Wriggers et al. [20,38], modified the sieve-fit routine so that the new atoms that are within tolerance distance are included in addition to the elimination of far apart atoms. Gerstein et al. [18] proposed the fit-all algorithm to classify the mechanism of domain rotation as hinge-like or shear-like. MolMovDB [17] used a modification of sieve-fit by stopping the procedure according to the domain size. These above algorithms can be regarded as the *backward methods *in statistical methods. The strategy of backward methods for fitting two point sets first fits to the entire points and then tries to remove bad points or weaken their effectiveness [35]. Unfortunately, as well-known in the statistics literature [35,39], some errors and outliers can influence the fitted model in the backward methods. The backward algorithms depend on the initial fit for the entire atom sets of proteins, which may fail upon structural superposition of two conformations with large displacements.

Damm and Carlson [2] used the wRMSD fit for superimposing two protein conformations in order to overcome the disadvantages of the LS fit. They also recognized that their method may yield poor results when the procedure starts with all the residue pairs for two significantly different structures (such as shifting the relative positions of two domains). Therefore, they presented the local wRMSD fit using an alternative starting procedure in a way similar to the forward regression spirit. The main difference between our work and wRMSD is the fitting optimization equations.

Recently, Fleishman et al. [35] introduced a robust moving least squares technique for fitting a piecewise smooth surface from a point set. The main tool that they use is a new robust statistics method for outlier detection: the *forward search *algorithm, which has a significant advantage in detecting outliers over commonly used backward methods. Unlike most existing backward methods, which depend on the initial estimating for the entire point set, the forward search starts from a small set of robustly chosen samples of the data that excludes outliers. Then the forward method moves forward through the data for adding observations to the subset while monitoring certain statistical estimates. Our work presented in this paper is in the same spirit and applies the forward search to structural superposition of flexible proteins. The main difference between our algorithm and the existing superposition methods is to replace "least squares" by "least median of squares" by combining the forward search such that the improved superposition algorithm is more robust for large displacements. Our method can be considered an alternative tool for structural superposition as a complement of other tools like sieve-fit, fit-all [18-20], and the wRMSD fit [2].

## Results

We have implemented the technique presented in the previous section and tested it on a number of proteins with known conformational changes. The algorithm described above is implemented in C++. In this paper, the execution time is given in seconds on a Pentium IV 1.70 GHz processor with 512M RAM excluding the time of loading proteins. For simplicity, our code and our examples in this article use only two conformations of each protein, but this algorithm could be applied into any program that iteratively superposition ensembles of structures.

Fig. Fig.11 shows the procedure of the LMS fit between two conformations based on the forward search algorithm. First, an initial subset (two point pairs in Fig. 1(a)) is selected using the LMS algorithm. Next, we iteratively add one pair of points with the smallest residual and refit two conformations to the updated subset using the standard RMSD fit. The subset at 10th iteration is shown in Fig. 1(b). If the error is larger than a predefined threshold, the iteration procedure is terminated. The final subset is shown in Fig. 1(c). The remaining points are regarded as outliers that are not used for computation of the final fit. The superposition results using the LMS and RMSD fit are given in Fig. 1(d) and Fig. 1(e), respectively. Arrows denote regions with improved fit.

**The illustration of the LMS fit between two conformations (RAN): 1byu (light gray) and 1rrp (dark gray)**. First, an initial subset is selected using the LMS algorithm, as shown in (a). Next, we iteratively add one pair of points with the smallest residual

**...**

### Protein data set

We have chosen to test our method on protein systems found in the Database of Macromolecular Movements (MolMovDB) [40]. MolMovDB presents a diverse set of proteins that display large conformational changes in protein and other macromolecules, which can be found at: http://www.molmovdb.org/. The corresponding experimental structures are downloaded from the Protein Data Bank (PDB) [41], and the first chain of each structure is used as the reference structure for superposition. PyMOL is used for various visualization purposes and the creation of figures for this article [42].

Our code currently implements our method using *C*_{α }coordinates of two protein conformations (it is straightforward to use all backbone atoms). Our preprocessing removes any inappropriate residues including duplicate residues, disordered residues, or heterogroups from the respective PDB file. We first apply the LMS fit to several protein systems in MolMovDB. Table Table11 lists the names of test systems and gives the superposition results for each protein system in the final LMS fit, where "Protein system" is the name of the test system, "PDB1" and "PDB2" are PDB codes of two conformations fitted, "RMSD" is the standard RMSD distance for the entire atom sets using the RMSD fit, "#Res" is the number of atom pairs after removing the inappropriate residues, "#Subset" is the number of atom pairs in the final subset, "Core%" is the proportion of the core region (i.e. the final subset) that belongs to the original point set (see Eq. (4)), and "Time(s)" is the time of computing the LMS fit. The proteins are chosen based on their interest to the community, variation in size, and range of conformational changes. When the structures between two conformations are very similar (e.g. RAN and ER*α*), there is usually a high "Core%". In contrast, the lower the similarity, the smaller the value of "Core%" (e.g. Calmodulin and Myosin). The presented algorithm is also fast. For instance, it performs a structural superposition for a pair of conformations with 700 amino acids in about half a second.

The superposition procedure first requires one to create a list of corresponding atom pairs; and then performs a LMS fit to bring the two proteins into proximity. Note that the LMS fit is not a tool for structure-based sequence alignment, which is a separate bioinformatics challenge [8,43]. Thus, like other structural superposition methods [2,8], the LMS fit requires a prior one-to-one mapping among the atoms/residues in the structures under consideration. Our method can be applied to align two homologous structures with different residues by incorporating some initial sequence or structural comparison to create the corresponding atom pairs.

### Parameters

The LMS fit algorithm presented in this paper involves two parameters: the maximal residual *r*_{max }(default is 2Å) and the minimal iteration number MIN_ITERS (default is [*N*/2.0]). Here, MIN_ITERS is usually chosen as a predefined integer to ensure that the number of atoms on core regions is more than 50% of entire atoms. In this section, we start by investigating the effect to the maximal residual *r*_{max}. The threshold *r*_{max }controls the final subsets fitted. In order to investigate only the effect of *r*_{max}, we first ignore the another termination condition that the iteration number should be larger than the minimal iteration number MIN_ITERS.

#### The maximal residual

Fig. Fig.22 shows the value of Core% with respect to the various *r*_{max }for four protein systems: ER*α*, RAN, Myosin and Calmodulin, which are referred to in Table Table1.1. We vary the threshold of the maximal residual, using *r*_{max }from 0 to 14Å, to determine its effect on the LMS fit. The value of Core% increases with *r*_{max }until to 100% reached for the entire atom pairs. This reason is that the LMS fit adds the atom pair with the minimal residual into the current subset at each iteration until all atom pairs are exhausted. When the structures are very similar, a small *r*_{max }can obtain a "tighter fit" of the rigid core with a high value of Core%. For instance, *r*_{max }= 1.0Å can get a value of Core% close to 80% for the ER*α *structure. In contrast, when the structures are dissimilar on large regions, a large *r*_{max }is required. Note that *r*_{max }more than 4.0Å can only get about Core% = 50% for the Calmodulin structure. Therefore, we found that it is not sufficient to superimpose all protein systems with high and low similarity if we only use a fixed *r*_{max}. To overcome this problem, we suggest that the maximal residual *r*_{max }and the minimal iteration number MIN_ITERS are combined for controlling the termination conditions. For protein systems with high similarity, *r*_{max }= 2.0Å usually is enough for obtain an appropriate subset. If when *r*_{max }= 2.0Å is not sufficient for protein systems with low similarity, MIN_ITERS can assure the number of the fitted subset is more than 50% of entire atoms. We found that the combination of *r*_{max }and MIN_ITERS with defaults can lead to fast convergence and little computation time for most protein systems in MolMovDB. In all results shown in this paper, we use *r*_{max }= 2Å and MIN_ITERS = [*N*/2.0] for obtaining both small errors and little computation time.

### Comparison of results

In this section, we first compare the visualization results of structural superposition for some conformations. Then we present a strategy, called *residual histogram*, for quantifying the superpositions.

#### Visualization comparison of superposition

In this section, we compare the performance of our algorithm with three superposition techniques: the RMSD fit, sieve-fit, and the wRMSD fit [2]. The sieve-fit source code can be found on the Gerstein Lab website http://faqs.gersteinlab.org/search?q=sieve, where we use the default parameters (the maximal iteration number is 500 and the distance threshold is 0.5). The wRMSD source code is available on the Carlson Lab website http://sitemaker.umich.edu/carlsonlab/resources.html. The Gaussian weight of wRMSD is computed by ${w}_{n}={e}^{-{({d}_{n})}^{2}/c}$, where *c *is a scaling factor and *d*_{n }is the distance between atom *n *in each protein conformation. In the old version of wRMSD fit, *c *is set to 2Å for similar structures; *c *is set to 5Å for non-similar structures. In structures with radical changes, the scaling factor may be as high as the initial RMSD between the structures. There are two programs (the global and local wRMSD fit) available. The local wRMSD is the recommended algorithm on proteins with extreme structural changes. Recently, Damm and Carlson updated the global wRMSD code that set the scaling factor to the standard RMSD value. The wRMSD fit can produce good structural superposition of two conformations with small and large displacements. The LMS and wRMSD fit achieves the similar results.

**Example 1**. The ER*α *structures (3erd and 3ert) are tested using the RMSD, LMS and wRMSD fit, where there are some small structural changes between 3erd and 3ert. Fig. Fig.33 shows the results of superposition for ER*α *using three methods. In the final RMSD fit (Fig. 3(a)), only 39 of 244 atom pairs common to both structures are within 1Å. Contrastively, the final LMS fit (Fig. 3(b)) has 188 atom pairs within 1Å, and the RMSD distance between two core regions (203 atom pairs) is 0.49Å. In addition, the final wRMSD fit (Fig. 3(c)) has also 188 atom pairs within 1Å. In Figs. 3(b) and 3(c), we observe that the fit results of LMS and wRMSD are very similar. When the change between two conformations is slight, the result of superposition using the LMS fit is approximately equal to one using the wRMSD fit [2]. Both LMS and wRMSD are able to highlight the similarity of the rigid core regions better than RMSD.

**Superposition comparison for the ER**. (a) The RMSD superposition. (b) The LMS superposition, where the maximal residual

*α*structures: 3erd (light gray) and 3ert (dark gray)*r*

_{max }of 2Å is used in our method. (c) The wRMSD superposition. For

**...**

**Example 2**. The Topo II structures (1bgw and 1bjt) are tested using four methods, where there are some large structural changes between 1bgw and 1bjt. Fig. Fig.44 shows the results of superposition for Topo II. Different crystal forms exhibit significant changes in overall architecture of Topo II, including an extremely large (170 degrees) domain rotation [44]. The changes between two conformations are too large such that the standard RMSD fit misses the structural similarity, as shown in Fig. 4(a). The final superpositions using the standard RMSD and the sieve-fit have 26 and 18 atom pairs within 2Å, respectively. The final LMS fit has 381 atom pairs within 2Å, and the RMSD distance between two core regions (389 atom pairs) is 0.85Å. Arrows in Fig. 4(d) highlight the improvement in fitting the rigid core of Topo II. The LMS fit can catch the structural similarity and our result is similar to one using the wRMSD fit with the default *c*, as shown Fig. 4(c).

**Superposition comparison for the Topo II structures: 1bgw (gray) and 1bjt (red)**. (a) The RMSD superposition. (b) The sieve-fit superposition. (c) The global wRMSD superposition (

*c*= 18Å). (d) The LMS superposition, where the maximal residual

**...**

**Example 3**. Figs. Figs.1,1, ,55 and and66 demonstrate the superposition results for three protein systems: RAN, Myosin and Calmodulin, which have large conformational displacements. In these figures, arrows highlight regions with improved fit using our method. The LMS fit takes about 0.19s, 0.58s and 0.09s, respectively. In the first protein system, the RAN structures (1byu and 1rrp) have large conformational changes, and the movement occurs in two switch regions. For the RAN structure, the final RMSD fit only captures 2 of the 200 atom pairs within 1Å; the final LMS fit keeps 116 atom pairs within 1Å. In the second protein system, the Myosin structures (1b7t and 1dfk) have much larger conformational changes, where the largest movements produced are more than 50Å. For the Myosin structures, the LMS fit contains 402 of the 720 atom pairs within 2Å, but there are only 30 atom pairs within this range when using the RMSD fit. In the third protein system, Calmodulin is a ubiquitous, calcium-binding protein that can bind to and regulate a multitude of different protein targets. We superimpose two conformational structures (1cll and 1ctr) of Calmodulin, where this hinge motion involves a long helix splitting into two helices and the angle between the axes of the two helical segments is about 100 degrees. Furthermore, as there is an additional twist around the helix axes, the total rotation of one domain relative to the other is upwards of 150 degrees. The final RMSD fit can not detect any atom pairs within 2Å; contrastively, the final LMS fit has 69 of the 138 atom pairs within 2Å.

**Superposition comparison for the Myosin structures: 1b7t (light gray) and 1dfk (dark gray)**. (a) The RMSD superposition. (b) The LMS superposition. (c) and (d) show the secondary structures corresponding to (a) and (b). (e) and (f) are the magnified views

**...**

**Superposition comparison for the Calmodulin structures: 1cll (light gray) and 1ctr (dark gray)**. (a) The RMSD superposition. (b) The LMS superposition. Arrows denote regions with improved fit.

**Example 4**. Finally, we compare a conventional LS superposition and the LMS superposition for 30 NMR models of the second Kunitz domain of Tissue Factor Pathway Inhibitor (PDB ID: 1adz), as shown in Fig. Fig.7.7. Here all conformations are superimposed with a reference structure (the first model) using the RMSD and LMS fit. In Fig. 7(a), the RMSD superposition provides misleading and inaccurate results; the LMS superposition in Fig. 7(b) can catch the similarity of multiple conformational structures, contrastively. This example in Fig. 7(a) is also used for demonstrating advantages of maximum likelihood superposition when assuming a Gaussian distribution of the whole structures in the analysis by Theobald et al. [8,16]. Our LMS superposition obtains the almost consistent result with maximum likelihood superposition for multiple structures.

#### Residual histogram

In this section, we use a residual histogram for demonstrating the residual distribution of atom pairs for the final LMS and RMSD fit. Fig. Fig.88 shows the residual histograms of five protein systems (ER*α*, RAN, Myosin, Calmodulin and Topo II) described above for the final RMSD, sieve-fit, the global wRMSD, and LMS superposition. Here a residual histogram is constructed by segmenting the length 0 – 10Å into equal sized ranges (1Å) and counting the number of atom pairs whose residuals are within each range. The horizontal axis of the histogram denotes the ranges segmented and the vertical axis is the number of counts. For example, in ER*α *Histogram in Fig. Fig.8,8, the first "LMS fit" bar on left denotes that there are 188 atom pairs whose residuals are within the range of 0 – 1Å for the ER*α *structures in the final LMS fit; and the second "LMS fit" bar on left means there are 15 atom pairs whose residuals are within the range of 1 – 2Å. In contrast, the first "RMSD fit" bar on left denotes that there are 39 atom pairs within the range of 0 – 1Å in the final RMSD fit.

**Residual histograms of five protein systems (ER**. Here a histogram is constructed by segmenting the distance from 0Å to 10Å into 10 equal sized ranges (each range

*α*, RAN, Myosin, Calmodulin, and Topo II) in the final superpositions**...**

The LMS fit tends to fit the rigid core of two conformations and ignore the effect of the flexible regions. Therefore, the atom pairs with little movement between two conformations will have a small residual (usually within 0 – 1Å) in the LMS fit. In contrast, these atom pairs are effected by the flexible regions in the RMSD fit. Although the RMSD fit minimizes the sum of distance of entire atom pairs, it can not guarantee the small residuals to the majority of atom pairs. In fact, the RMSD fit is only the minimization in the sense of average. In the final RMSD fit, each atom pair shares both little movement on the core regions and large movement on the flexible between two conformations. In the examples in Fig. Fig.8,8, we observe that the number of counts for the LMS fit within the range of 0 – 1Å is far larger than one for the RMSD fit. In special, Calmodulin Histogram in Fig. Fig.88 shows that no atom pair is within two ranges of 0 – 1Å and 1 – 2Å in the final RMSD fit for two conformations of Calmodulin, whereas 69 of the 138 atom pairs are within the two ranges in the final LMS fit. In contrast, the wRMSD fit achieves similar results with the LMS fit (especially within 0 – 2Å), while there are few atom pairs within the range of 0 – 2Å in the final sieve-fit.

Finally, to obtain a broader overview we apply the LMS fit to a collection of known protein systems with conformational changes in MolMovDB (as of October 2007). The conformational database is classified by the size of the mobile regions as three groups: 1) motions of fragments smaller than domains, 2) domain motions, and 3) larger movements than domain movements involving the motion of subunits. We simply call the three groups: SM (small movement), MM (medium movement) and LM (large movement). There are 56, 123 and 22 protein systems that are available in the three groups, respectively. For these examples shown in Table Table1,1, ER*α *is selected from the SM group, Topo II is selected from the LM group, and the other protein systems are selected from the MM group except Pneumolysin. Especially, the motions of RAN and Calmodulin is predominantly hinge type and Topo II has complex protein motion. All protein systems have at least one pair of conformations, and animations of the conformational transition are available for most protein systems. To avoid bias from large families with multiple conformations, we retained only one pair of conformations per protein system, leading to 201 pairs of conformational structures. The same parameters (*r*_{max }= 2Å and MIN_ITERS = [*N*/2.0]) are used in all the calculations. Fig. Fig.99 shows the average residual histograms for protein systems in SM, MM, LM, and three groups in the final superpositions using the RMSD and LMS fit. The final LMS fit has the average of 163, 177, and 234 atom pairs within 0 – 1Å for the SM group, the MM group, and the LM group, respectively; whereas the final RMSD fit only has the average of 141, 111 and 177 atom pairs within the this range. The average of 192 atom pairs for three groups is within 0 – 1Å in the final LMS fit; the average of 143 atom pairs is within this range in the final RMSD fit. In the final LMS fit for three groups, the average value of Core% is 79.7%, and the average RMSD distance in the core regions is 1.1Å.

### Multiple level superposition

It was previously shown that there is generally not a unique solution for the structural fit between two prteins [2,15]. If two different conformations each consists of multiple rigid domains, our LMS fit algorithm will get the subset in the biggest rigid domain for computing the superposition. An extension version of our algorithm can also be extended to multiple level superposition between two protein conformations with several rigid domains. Given two conformations **X **and **Y **with multiple several rigid domains, we present an iterative algorithm for determining multiple level superposition of **X **and **Y **as follows.

1. First, we compute the core regions *Q*_{x }and *Q*_{y }of two conformations **X **and **Y **using the LMS fit algorithm and identify the rest of the data as outliers.

2. Next, we remove the core regions *Q*_{x }and *Q*_{y }from **X **and **Y**, and update two conformations as **X **= **X **- *Q*_{x }and **Y **= **X **- *Q*_{y}, respectively. Then we recompute the LMS fit between the updated **X **and **Y**.

3. The above Steps 1 and 2 are repeated until the superposition level defined by users is reached, where the superposition level denotes which level rigid domain is finally superimposed. The final centers and rotation matrix are computed by the final level rigid domain.

Several examples are shown in Figs. Figs.10,10, ,11,11, ,1212 for demonstrating the multiple level superposition algorithm. Fig. Fig.1010 illustrates two level superposition for the Calmodulin structures: 1cll (light gray) and 1ctr (dark gray). The first level superposition has one common big rigid domain with Core% = 51.4% in Fig. 10(a), and the second level superposition has one common small rigid domain with Core% = 46.4% in Fig. 10(b). Fig. Fig.1111 gives four level superposition for Topo II: 1bgw (red) and 1bjt (green). Fig. Fig.1212 gives three level superposition for GroEL: 1aon (red) and 1kp8 (green). Note that our method can capture several different rigid domains with multiple levels, where the superimposed rigid domains are highlighted in the selected regions with the solid line boundary.

**Multiple level superposition for the Calmodulin structures: 1cll (light gray) and 1ctr (dark gray)**. (a) The first level superposition (Core% = 51.4%). (b) The second level superposition (Core% = 46.4%). Note that our method can capture two rigid domains

**...**

**Multiple level superposition for Topo II: 1bgw (red) and 1bjt (green)**. (a) Level 1 (Core% = 56.4%). (b) Level 2 (Core% = 22.1%). (c) Level 3 (Core% = 11.7%). (d) Level 4 (Core% = 5.1%). Note that our method can capture different rigid domains in multiple

**...**

**Multiple level superposition for GroEL: 1aon (red) and 1kp8 (green)**. (a) Level 1 (Core% = 47.5%). (b) Level 2 (Core% = 30.5%). (c) Level 3 (Core% = 11.6%). Note that our method can capture three different rigid domains with three levels, where the superimposed

**...**

The multiple level superposition algorithm is actually the extension of the LMS fit. This algorithm can be performed through a parameter 'level' without specifying and choosing any residues. The local wRMSD fit can finish a similar function as multiple level superposition by sampling some subsets of the protein for changing the initial RMSD fit in advance [2].

## Discussion

In this section, we will discuss median measurement changing and comparison of similarity scores.

### Changing median measurement

If the flexible regions between two conformations are too large such that the rigid core region contains less 50% atoms of the entire atom sets of protein, we do not see good superposition using the LMS fit based on the minimal median assumption. Fig. Fig.1313 demonstrates this issue using the Pneumolysin structures (2bk1 and 2bk2 from CryoEM) [45]. In Fig. 13(a), when the LMS fit based on the minimal median measurement is applied for two conformations, we do not see the good superposition. The main reason is that the rigid core region only contains about 30% atoms of the entire atom sets of protein. The special case is not usual, and there are few protein systems like Pneumolysin in MolMovDB.

**If the number of atom pairs on the flexible regions is larger than one on the core region, the LMS fit based on the minimal median measurement can not get good superposition**. (a) The LMS fit for the Pneumolysin system: 2bk1 (light gray) and 2bk2 (dark

**...**

For this case that the flexible regions contain more atoms than the core region, we can simply change the "median" parameter in the LMS fit for improving the superposition. At the phase of initial subset selection, the original LMS fit uses the random sampling algorithm for selecting *k *initial point pairs with a small value of *k*. At each iteration, 1) *k *point pairs are first selected between two point sets at random; 2) then the median of the residuals of the remaining point pairs is computed; 3) finally, *k *point pairs with the minimal median are selected as the initial subset for the forward search. Instead of the minimal median measurement, we may use the *m*th smallest value from the residuals of the remaining point pairs for improving the initial point pairs. In Fig. 13(b), we use the first quartile (25%) instead of the median (50%) for cutting largest 75% outliers. The first quartile actually assumes that the flexible regions contain up to 75% atoms of the entire atom sets of protein. The superposition difference is highlighted in the ellipse regions with the dashed boundary.

### Comparison of similarity scores

One application of comparing two conformations of the same protein sequence is to evaluate a predicted protein structure against its experimentally determined target. We examine one system Target 179 (PDB ID: 1IY9) in the CASP5 competition [46] for comparing our similarity score with three ones (GDT_TS, TM-score and wRMSD's scores). The GDT_TS values can be obtained from the CASP5 website http://predictioncenter.org/casp5/Casp5.html, and the TM-score [47] can be computed from TM-score online http://zhang.bioinformatics.ku.edu/TM-score/. The specific target has been discussed in Damm and Carlson's work [2], and the wRMSD's scores (%wSUM and %wSUM_ALL) discussed here are directly cited from their paper. Similar to their strategy, we provide a Core% score based on the fit of the coordinates in the prediction (*N *in Eq. (4) equals the number of atoms in the prediction) and a Core_All%, which corrects for any omitted coordinates (*N *in Eq. (4) equals the number of atoms in the target). If a prediction provides all *C*_{α }coordinates, Core% and Core_All% are equal. GDT_TS (Global Distance Test_Total Score) evaluates two structures based on the RMSD fit of a subset of atoms in an iterative weighted evaluation, and TM-score is an extension of GDT. %wSUM and %wSUM_ALL scores are the average of weight values in the final wRMSD superposition.

Damm and Carlson randomly selected some good, exceptional and poor submissions from Target 179's groups. We use the same data. Since some poor submissions are included in the groups, we choose the first quartile (25%) as the measurement parameter instead of the median (50%). Table Table22 shows that the rankings provided by Core_All%. Core_All% scores match %wSUM_ALL and GDT_TS with the exception of groups 32 and 270. Damm and Carlson have analyzed that the cause for 32's poor GDT_TS rank may be a simple typographical or data processing error. In contrast, TM-score gives a top ranking for 32 group liking Core_All% (the top one in %wSUM_ALL and second one in %wSUM). Group 270 has also the different ranking among %wSUM_ALL and TM-score. By superposition, we found that group 270 is a good predictions and it looks very similar to the target. The small ranking difference between three methods may be reason of the weight values. The LMS scores (Core% and Core_All%) can be considered an alternative and complementary similarity score for assessing the quality of protein conformations.

## Conclusion

We have presented a novel technique of structural superposition for flexible proteins. The method is based on least median of squares (LMS) for guiding the classical RMSD fit. The forward search technique is used for approximating the LMS optimization. Using the method, we can automatically identify portions of proteins as the rigid core regions and flexible regions. The method does not require a prior knowledge of the flexible regions. Our fit tool has produced successful superposition when applied to proteins in MolMovDB for which two conformations are known. We also show that the LMS fit can be extended to multiple level superposition between two conformations with several rigid domains. This method can easily be incorporated into many RMSD overlay calculations. Note that LMS can not be a substitute for LS in some cases, such as the applications of LS to molecular dynamics (MD).

## Methods

### Least median of squares (LMS) fit

To overcome the lack of robustness using least squares fit in Eq. (1), some robust methods might be used for improving the RMSD fit, such as making use of some weight functions for bounding the influence of outliers [2]. Most existing robust methods are *least sum of squares *(also named *least squares *or LS), which can not raise a high breakdown point [36].

In our case, we assume that two different conformations of the same protein consists of two parts: the rigid *core *regions with high structural stability and the remaining *flexible *regions, and there is no overlap between them. Atoms in the core regions barely move between the two conformations. Indeed, the goal of the above assumption is to distinguish two different conformations as the "good" and "bad" parts. The core regions are assumed to contain at least 50% points of the entire point set, so the remaining flexible regions have up to 50% points. In our work, we treat the flexible regions as outliers. Our motivation is to improve the least sum of squares in the RMSD fit using a fit method with a high breakdown point (up to 50%). The *least median of squares *(LMS) is a robust statistics method that estimates the parameters of the model by minimizing the median of the absolute *residuals*. In other words, LMS replaces the sum of least squares by a *median*. The breakdown point of LMS is as high as 50% [36]. The resulting estimator using LMS can resist the effect of nearly 50% of contamination in the input data, which is applicable to our case. Given a rotation matrix **U**, the absolute residual is defined as the distance between the rotation point ${{x}^{\prime}}_{i}$ = **Ux**_{i }and the target point **y**_{i}; for the *i*th point pair the residual is *r*_{i }= ||${{x}^{\prime}}_{i}$ - **y**_{i}||. Based on the given **U**, the median of absolute residuals between two point sets is defined as:

In this paper, our goal is to search a best rotation matrix **U **that minimizes the median *D*_{median }of the residuals as follows:

where **U **is the optimal rotation matrix that will be computed. Rousseeuw [36] has also pointed out there always exists a solution for LMS.

#### Random sampling algorithm for computing the LMS optimaztion

Eq. (3) can be solved using the following *random sampling *algorithm (i.e. RANSAC) [35,48]. First, *k *point pairs are randomly selected between two point sets, and the first rotation matrix is computed using the RMSD algorithm to the *k *point pairs. Next the median of the residuals of the remaining *N *- *k *point pairs is computed. The process is repeated *T *times to generate *T *candidate rotation matrices. The matrix with the minimal median is selected as the final rotation matrix **U**. A small value of *k *does not use all of the available points to the fit computation, while a larger value of *k *requires more iterations. If *k *is too large, the algorithm becomes sensitive to outliers, i.e. local displacements.

### The forward search

The forward search algorithm [39] is a new robust method that avoids the need to fix *k*. Recently, Fleishman et al. [35] have applied this technique to fit surfaces from point clouds in computer graphics. The forward algorithm first searches a small outlier-free subset and then iteratively refines the subset by adding one sample at a time. This is in contrast to the backward algorithms, which first deal with the entire data points and then delete bad samples. Fleishman et al. [35] showed that some outliers with large error may fail on fit based on the backward algorithms, whereas the forward algorithm gives satisfactory results. For our purpose, the initial subset is computed using the LMS algorithm using a small *k *value, where *k *is typically close to *p *for a model with *p *parameters (specially *p *= 3 in the 3D case) [35,39]. During the forward search, a number of parameters can be monitored to detect the influential points. Atkinson et al. [39] suggested several statistics, including the residual-plot, Cook's distance and others, to be monitored. For their purposes, these are plotted on a graph and inspected visually. In [35], Fleishman et al. suggested to monitor the maximal residual *r*_{max}. The above monitoring techniques are essentially to determine the termination conditions for the forward search iteration. In our technique, we also monitor the maximal residual similar to Fleishman et al.'s strategy [35].

### The LMS fit algorithm

Using the forward search technique for solving Eq. (3), we present a new LMS fit algorithm for structural superposition of two point sets {**x**_{i}} and {**y**_{i}} with *N *points each in order to compute the centers and the rotation matrix **U**.

1. Compute the small outlier-free subset *Q*_{x }⊆ {**x**_{i}} and *Q*_{y }⊆ {**y**_{i}} using the LMS algorithm, which is described as *random sampling *above.

2. The centers and rotation matrix **U **are computed for *Q*_{x }and *Q*_{y }using the RMSD fit.

3. One pair of points with the minimal residual *r*_{min }in the remaining point pairs are added into *Q*_{x }and *Q*_{y}, respectively.

4. Repeat steps 2 and 3 until *r*_{min }is larger than a predefined threshold *r*_{max }and the iteration number *iter *is larger than the minimal iteration number MIN_ITERS. Finally, identify points in *Q*_{x }and *Q*_{y }as the rigid core regions and points in ({**x**_{i}} - *Q*_{x}) and ({**y**_{i}} - *Q*_{y}) as outliers or flexible regions.

Implementation details of the LMS fit are described in **Appendix**.

#### Initial robust estimator

In the first step of the forward search algorithm, the initial subset is computed using the LMS algorithm with a small *k *value (we typically choose *k *= 3). If the atom number *N *of protein is small, the choice of the initial subset can be performed by exhaustive enumeration of all ($\left(\begin{array}{c}N\\ k\end{array}\right)$); otherwise, LMS uses the random sampling algorithm that requires a large iteration number *T *to achieve a high probability of finding a good estimator. The LMS algorithm, as a statistical method, assumes that the samples (points) are independent. If *g *is the probability of selecting a single good sample at random from two original point sets {**x**_{i}} and {**y**_{i}}, then the probability *P *of successfully finding *k *good samples after *T *iterations can be computed by *P *= 1 - (1 - *g*^{k})^{T }[35]. In our implementation, we use *T *= 500 for the small proteins (e.g. *N *< 900) and *T *= 1000 for the large proteins (e.g. *N *≥ 900) in order to obtain both small errors and little computation time.

#### Termination conditions

In the fourth step, there are two termination conditions (i.e. *r*_{min }> *r*_{max }and *iter *> MIN_ITERS). *r*_{max }is the threshold of maximal residual. The threshold *r*_{max }controls the fitted subsets. Smaller values of *r*_{max }does not use all of the available atom pairs to fit, while a larger value for *r*_{max }requires more iterations and the algorithm becomes sensitive to outliers. If *r*_{max }is too large such that the final subset is equal to the input point set, i.e. no outlier detected, the LMS fit is same to the RMSD fit algorithm. In some sense, the RMSD fit is only one special case of our algorithm. In our experiments, the errors would have to be on the order of Angstroms. We have found that *r*_{max }in the range of 1Å to 4Å is able to highlight the similarity of the rigid core regions.

In addition, another termination condition, in which the iteration number *iter *should be larger than the minimal iteration number MIN_ITERS, is also very important. MIN_ITERS is usually chosen as a predefined integer to ensure that the number of atoms on core regions is more than 50% of atoms (generally [*N*/2.0] ≤ MIN_ITERS ≤ *N*). This constraint condition satisfies the LMS assumption in which the core regions contain at least 50% points of the entire point set. We typically choose MIN_ITERS as the half of the number of atoms, i.e. MIN_ITERS ⇐ [*N*/2.0].

#### A new similarity measurement

A standard RMSD fit minimizes the sum of residuals for entire atom pairs, whereas the LMS fit minimizes the median of residuals for entire atom pairs. When finishing the LMS fit using the forward search technique, we can obtain two similarity measurements. One is the median distance *D*_{median }by computing Eq. (2) for two entire point sets. Another is the RMSD distance *D*_{rmsd }by computing the the square root of Eq. (1) for the final subset *Q*_{x }and *Q*_{y}. Being different with *D*_{median}and *D*_{rmsd }defined by absolute distances, we present a new similarity measurement:

where *N*_{core }is the number of atoms of the core region, and *N *is the number of entire atoms of protein. The value of Core% denotes the proportion of the core region (i.e. the final subset) that belongs to the entire point set. It is more intuitive to measure the similarity between two conformations than the absolute distances *D*_{median }and *D*_{rmsd}. The maximum value of Core% occurs when *N*_{core }is equal to *N *(i.e. the distances between all atom pairs are less than *r*_{max}). The lower the similarity, the smaller the value of Core%. The value of Core% can be directly used for the similarity score between two protein structures.

We will investigate the effect of Core% with respect to *r*_{max }in the next section.

## Authors' contributions

YL generated the original idea, executed the research, and wrote the manuscript. YF participated in the research. KR supervised the project and edited the paper. All authors read and approved the final manuscript.

## Appendix: LMS implementation details

The outline of an algorithm for the LMS fit, called **LMSfit**, is given in Algorithm 1. The algorithm takes as input two point sets **X **and **Y **with *N *points each in order to compute the centers **c**_{x }and **c**_{y }and the rotation matrix **U**. This is achieved through an iterative procedure with the aid of two variables *Q*_{x }and *Q*_{y }which are the working subset of superposition between **X **and **Y**, respectively. Initially, *Q*_{x }and *Q*_{y }are computed using the LMS algorithm through selecting *k *point pairs at random with *T *iterations, as illustrated in Algorithm 2.

Algorithm 2 is passed into three point sets (**X **and **Y**) in order to produce *k *point pairs as the initial subset for the forward search (typically *k *= 3). First, a loop with *T *iterations begins. At each iteration, two subsets (*S*_{x }and *S*_{y}) with *k *points each are selected at random, and then two centers of the two subsets and a rotation matrix are computed using the standard RMSD fit. Next, residuals of all point pairs in the remaining subsets are calculated as the distance between each rotation point and the corresponding target point. Finally, the median *r*_{median }of residuals of the remaining subset is obtained. The subsets (*S*_{x }and *S*_{y}) with the minimal median *r*_{median }are returned as the final initial subsets (*Q*_{x }and *Q*_{y}), respectively. During the iterative procedure in Algorithm 1, the cardinality of *Q*_{x }and *Q*_{y }is gradually increased by adding one pair of points (**x* **and **y***) with the minimal residual every time. In this way, one is able to increase *Q*_{x }and *Q*_{y }regarded as the core region in the forward search. If the residuals of the remaining point pairs are more than a threshold *r*_{max}, the procedure is terminated. Finally, the final subset *Q*_{x }and *Q*_{y }are regarded as the core regions and the points in *Q*_{x }and *Q*_{y }are used to compute the final centers **c**_{x }and **c**_{y }and the rotation matrix **U**, and the remaining points are identified as outliers or flexible regions.

**Algorithm 1**: **LMSfit**(**X**, **Y**, **c**_{x}, **c**_{y}, **U**)

Input: **X**: the first point set with *N *points

**Y**: the second point set with *N *points

Output:

**c**_{x }and **c**_{y}: the final centers of **X **and **Y**

**U**: the rotation matrix

Local variables:

*k*: the number of random samples

*Q*_{x }and *Q*_{y}: the subsets of **X **and **Y**

*R*_{x }and *R*_{y}: the sets of the remaining points, i.e. *R*_{x }⇐ **X **- *Q*_{x }and *R*_{x }**Y **- *Q*_{y}

${\tilde{c}}_{\text{x}}$ and ${\tilde{c}}_{\text{y}}$: the temporary centers

$\tilde{U}$: the temporary rotation matrix

*r*_{min}: the minimal residual

*r*_{max}: the maximal residual

begin

1: *Q*_{x }⇐ ∅; *Q*_{y }⇐ ∅;

2: **LMS**(**X**, **Y**, *k*, *Q*_{x}, *Q*_{y});

3: *I *⇐ 0;

4: *R*_{x }⇐ **X **- *Q*_{x}; *R*_{y }⇐ **Y **- *Q*_{y};

5: MIN_ITERS ⇐ [*N*/2.0];

6: **while **(|*R*_{x}| > 0) **do**

7: Compute two centers ${\tilde{c}}_{\text{x}}$ and ${\tilde{c}}_{\text{y}}$ for *Q*_{x }and *Q*_{y};

8: Translate **X **and **Y **to ${\tilde{c}}_{\text{x}}$ and ${\tilde{c}}_{\text{y}}$, and compute the rotation matrix $\tilde{U}$ for two translated point sets using the standard RMSD fit algorithm;

9: /* Compute **r **as residuals of all pairs of points between *R*_{x }and *R*_{y }*/

10: **for **(*i *= 0; *i *< |*R*_{x}|; *i *+ +) **do**

11: **x**_{i }⇐ *R*_{x}(*i*) and **y**_{i }⇐ *R*_{y}(*i*);

12: **r**(*i*) ⇐ ||$\tilde{U}$(**x**_{i }- ${\tilde{c}}_{\text{x}}$) - (**y**_{i }- ${\tilde{c}}_{\text{y}}$)||;

13: **end for**

14: Get the pair of points **x* **and **y* **with the minimal residual *r*_{min }for **r**;

15: **if **(*r*_{min }> *r*_{max }and *I *> MIN_ITERS) **then**

16: **return**

17: **end if**

18: /* Update the subsets and the remaining point sets */

19: *Q*_{x }⇐ *Q*_{x }+ **x* **and *Q*_{y }⇐ *Q*_{y }+ **y***;

20: *R*_{x }⇐ *R*_{x }- **x* **and *R*_{y }⇐ *R*_{y }- **y***;

21: /* Update the centers and rotation matrix */

22: **c**_{x }⇐ ${\tilde{c}}_{\text{x}}$, **c**_{y }⇐ ${\tilde{c}}_{\text{y}}$, and **U **⇐ $\tilde{U}$;

23: *I *+ +;

24: **end while**

end

**Algorithm 2**: **LMS**(**X**, **Y**, *k*, *Q*_{x}, *Q*_{y})

Input:

**X**: the first point set with *N *points

**Y**: the second point set with *N *points

*k*: the number of random samples

Output:

*Q*_{x }and *Q*_{y}: the initial subsets from **X **and **Y**

Local variables:

*T*: the number of iterations

*S*_{x }and *S*_{y}: the subsets selected randomly

*R*_{x }and *R*_{y}: the set of the remaining points, i.e. *R*_{x }⇐ **X **- *S*_{x }and *R*_{y }⇐ **Y **- *S*_{y}

**r**: the vector of redsiduals

**c**_{x }and **c**_{y}: the centers of the subsets *S*_{x }and *S*_{y}

**U**: the rotation matrix

*r*_{median}: the median of redsiduals

*r*_{min}: the minimal redsidual

begin

1: *r*_{min }⇐ ∞;

2: **for **(*j *= 0; *j *<*T*; *j *+ +) **do**

3: Select randomly *k *pairs of points: *S*_{x }and *S*_{y}, with the same order from **X **and **Y**, respectively;

4: Compute two centers **c**_{x }and **c**_{y }for *S*_{x }and *S*_{y};

5: Translate *S*_{x }and *S*_{y }to **c**_{x }and **c**_{y}, and then compute the rotation matrix **U **for two translated subsets using the RMSD algorithm;

6: Compute the sets of the remaining points as: *R*_{x }⇐ **X **- *S*_{x }and *R*_{y }⇐ **Y **- *S*_{y};

7: /* Compute **r **as residuals of all pairs of points between *R*_{x }and *R*_{y }*/

8: **for **(*i *= 0; *i *< |*R*_{x}|; *i *+ +) **do**

9: **x**_{i }⇐ *R*_{x}(*i*) and **y**_{i }⇐ *R*_{y}(*i*);

10: **r**(*i*) ⇐ ||**U**(**x**_{i }- **c**_{x}) - (**y**_{i }- **c**_{y})||;

11: **end for**

12: Compute the median *r*_{median }by sorting **r**;

13: **if **(*r*_{median }<*r*_{min}) **then**

14: *r*_{min }⇐ *r*_{median};

15: *Q*_{x }⇐ *S*_{x }and *Q*_{y }⇐ *S*_{y};

16: **end if**

17: **end for**

end

## Acknowledgements

We would like to thank Dr. Talapady Bhat for some helpful comments during our work. This material is partly based upon work supported by the National Institute of Health (GM-075004). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. We also acknowledge partial support from the National Science Foundation under Grant IIS No. 0535156.

## References

- Buck E, Iyengar R. Organization and functions of interacting domains for signaling by protein-protein interactions. Sci STKE. 2003:re14. [PubMed]
- Damm K, Carlson H. Gaussian-weighted RMSD superposition of proteins: A structural comparison for flexible proteins and predicted protein structures. Biophysical Journal. 2006;90:4558–4573. [PMC free article] [PubMed]
- Hilser V, Dowdy D, Oas T, Freire E. The structural distribution of cooperative interactions in proteins: Analysis of the native state ensemble. Proc Natl Acad Sci. 1998;95:9903–9908. [PMC free article] [PubMed]
- Luque I, Freire E. Strutural stability of binding sites: Consequences for binding affinity and allosteric effects. Proteins. 2000;41:63–71. [PubMed]
- Pawson T, Nash P. Assembly of cell regulatory systems through protein interaction domains. Science. 2003;300:445–452. [PubMed]
- Chiang R, Meng E, Huang C, Ferrin T, Babbitt P. The structure superposition database. Nucleic Acids Research. 2003;31:505–510. [PMC free article] [PubMed]
- Flower D. Rotational superposition: A review of methods. Journal of Molecular Graphics and Modelling. 1999;17:238–244. [PubMed]
- Theobald D, Wuttke D. THESEUS: Maximum likelihood superpositioning and analysis of macromolecular structures. Bioinformatics. 2006;22:2171–2172. [PMC free article] [PubMed]
- Coutsias E, Seok C, Dill K. Using quaternions to calculate RMSD. Journal of Computational Chemistry. 2004;25:1849–1857. [PubMed]
- Horn B. Closed-form solution of absolute orientation using unit quaternions. Journal of the Optical Society of America. 1986;4:629–642.
- Kabsch W. A solution for the best rotation to relate two sets of vectors. Acta Crystallographica. 1976;32:922–923.
- Kabsch W. A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallographica. 1978;34:827–828.
- Maiorov V, Crippen G. Significance of root-mean-square deviation in comparing three-dimensional structures of globular proteins. Journal of Molecular Biology. 1994;235:625–634. [PubMed]
- Ye Y, Godzik A. Flexible structure alignment by chaining aligned fragment pairs allowing twists. Bioinformatics. 2003;19:ii246–ii255. [PubMed]
- Godzik A. The structural alignment between two proteins: Is there a unique answer? Protein Science. 1996;5:1325–1338. [PMC free article] [PubMed]
- Theobald D, Wuttke D. Empirical Bayes hierarchical models for regularizing maximum likelihood estimation in the matrix Gaussian Procrustes problem. Proc Natl Acad Sci. 2006;103:18521–18527. [PMC free article] [PubMed]
- Krebs W, Gerstein M. The morph server: a standardized system for analyzing and visualizing macromolecular motions in a database framework. Nucleic Acids Research. 2000;28:1665–1675. [PMC free article] [PubMed]
- Gerstein M, Chothia C. Analysis of protein loop closure: two types of hinges produce one motion in lactate dehydrogenase. Journal of Molecular Biology. 1991;220:133–149. [PubMed]
- Lesk AM. Protein Architecture: A Practical Guide. IRL Press, Oxford; 1991.
- Wriggers W, Schulten K. Protein domain movements: detection of rigid domains and visualization of hinges in comparisons of atomic coordinates. Proteins. 1997;29:1–14. [PubMed]
- Carugo O, Pongor S. A normalized root-mean-spuare distance for comparing protein three-dimensional structures. Protein Science. 2001;10:1470–1473. [PMC free article] [PubMed]
- Zhang Y, Skolnick J. TM-align: A protein structure alignment algorithm based on TM-score. Nucleic Acids Research. 2005;33:2302–2309. [PMC free article] [PubMed]
- Kavraki L. Molecular Distance Measures. 2007. http://cnx.org/content/m11608/latest/
- Wallin S, Farwer J, Bastolla U. Testing similarity measures with continuous and discrete protein models. Proteins. 2003;50:144–157. [PubMed]
- Maiti R, Van Domselaar G, Zhang H, Wishart D. SuperPose: A simple server for sophisticated structural superposition. Nucleic Acids Research. 2004;32:W590–594. [PMC free article] [PubMed]
- Diamond R. On the multiple simultaneous superposition of molecular structures by rigid body transformations. Protein Science. 1992;1:1279–1287. [PMC free article] [PubMed]
- Eidhammer I, Jonassen I, Taylor W. Structure comparison and structure pattern. Journal of Computational Biology. 2000;7:685–716. [PubMed]
- Kearsley S. An algorithm for the simultaneous superposition of a structural series. Journal of Computational Chemistry. 1990;11:1187–1192.
- Perkins T, Dean P. An exploration of a novel strategy for superposing several flexible molecules. J Comput Aided Mol Des. 1993;7:155–172. [PubMed]
- Lathrop RH. The protein threading problem with sequence amino acid interaction preferences is NP-complete. Protein Engineering. 1994;7:1059–1068. [PubMed]
- Holm L, Sander C. Protein structure comparison by alignment of distance matrices. Journal of Molecular Biology. 1993;233:123–138. [PubMed]
- Shindyalov I, Bourne P. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Engineering. 1998;11:739–747. [PubMed]
- Shatsky M, Nussinov R, Wolfson H. Flexible protein alignment and hinge detection. Proteins. 2002;48:242–256. [PubMed]
- Sumathi K, Ananthalakshmi P, Roshan M, Sekar K. 3dSS: 3D structural superposition. Nucleic Acids Research. 2006;34:W128–W132. [PMC free article] [PubMed]
- Fleishman S, Cohen-Or D, Silva C. Robust moving least-squares fitting with sharp features. ACM Transactions on Graphics (SIGGRAPH 2005) 2005;24:544–552.
- Rousseeuw P. Least median of squares regression. Journal of the American Statistical Association. 1984;79:871–880.
- Page R, Lindberg U, Schutt CE. Domain Motions in Actin. Journal of Molecular Biology. 1998;280:463–474. [PubMed]
- Choi V, Goyal N. An algorithmic approach to the identification of rigid domains in proteins. Algorithmica. 2007;48:343–362.
- Atkinson A, Riani M. Robust diagnostic regression analysis. Springer; 2000.
- Echols N, Milburn D, Gerstein M. MolMovDB: Analysis and visualization of conformational change and structural flexibility. Nucleic Acids Research. 2003;31:478–482. [PMC free article] [PubMed]
- Berman H, Westbrook J, Feng Z, Gilliland G, Bhat T, Weissig H, Shindyalov I, Bourne P. The Protein Data Bank. Nucleic Acids Research. 2000;28:235–242. [PMC free article] [PubMed]
- DeLano W. The PyMOL Molecular Graphics System. DeLano Scientific, San Carlos, CA; http://www.pymol.org
- Bourne P, Shindyalov I. Structure comparison and alignment. Methods Biochem Anal. 2003;44:321–337. [PubMed]
- Fass D, Bogden C, Berger J. Quaternary changes in topoisomerase II may direct orthogonal movement of two DNA strands. Nature Structural Biology. 1999;6:322–326. [PubMed]
- Tilley S, Orlova E, Gilbert R, Andrew P, Saibil H. Structural basis of pore formation by the bacterial toxin pneumolysin. Cell. 2005;121:247–256. [PubMed]
- Moult J, Fidelis K, Zemla A, Hubbard T. Critical assessment of methods of protein structure prediction (CASP)-Round V. Proteins. 2003;53:334–339. [PubMed]
- Zhang Y, Skolnick J. Scoring function for automated assessment of protein structure template quality. Proteins. 2004;57:702–710. [PubMed]
- Fischler MA, Bolles RC. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM. 1981;24:381–395.

**BioMed Central**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (6.8M) |
- Citation

- Gaussian-weighted RMSD superposition of proteins: a structural comparison for flexible proteins and predicted protein structures.[Biophys J. 2006]
*Damm KL, Carlson HA.**Biophys J. 2006 Jun 15; 90(12):4558-73. Epub 2006 Mar 24.* - FLEXS: a method for fast flexible ligand superposition.[J Med Chem. 1998]
*Lemmen C, Lengauer T, Klebe G.**J Med Chem. 1998 Nov 5; 41(23):4502-20.* - IDSS: deformation invariant signatures for molecular shape comparison.[BMC Bioinformatics. 2009]
*Liu YS, Fang Y, Ramani K.**BMC Bioinformatics. 2009 May 22; 10:157. Epub 2009 May 22.* - Objective comparison of protein structures: error-scaled difference distance matrices.[Acta Crystallogr D Biol Crystallogr. 2000]
*Schneider TR.**Acta Crystallogr D Biol Crystallogr. 2000 Jun; 56(Pt 6):714-21.* - Advances and pitfalls of protein structural alignment.[Curr Opin Struct Biol. 2009]
*Hasegawa H, Holm L.**Curr Opin Struct Biol. 2009 Jun; 19(3):341-8. Epub 2009 May 27.*

- Optimal Superpositioning of Flexible Molecule Ensembles[Biophysical Journal. 2013]
*Gapsys V, de Groot BL.**Biophysical Journal. 2013 Jan 8; 104(1)196-207* - Event Detection and Sub-state Discovery from Bio-molecular Simulations Using Higher-Order Statistics: Application To Enzyme Adenylate Kinase[Proteins. 2012]
*Ramanathan A, Savol AJ, Agarwal PK, Chennubhotla CS.**Proteins. 2012 Nov; 80(11)2536-2551* - 3DMolNavi: A web-based retrieval and navigation tool for flexible molecular shape comparison[BMC Bioinformatics. ]
*Liu YS, Wang M, Paul JC, Ramani K.**BMC Bioinformatics. 1395* - Using diffusion distances for flexible molecular shape comparison[BMC Bioinformatics. ]
*Liu YS, Li Q, Zheng GQ, Ramani K, Benjamin W.**BMC Bioinformatics. 11480* - Robust probabilistic superposition and comparison of protein structures[BMC Bioinformatics. ]
*Mechelke M, Habeck M.**BMC Bioinformatics. 11363*

- CompoundCompoundPubChem chemical compound records that cite the current articles. These references are taken from those provided on submitted PubChem chemical substance records. Multiple substance records may contribute to the PubChem compound record.
- MedGenMedGenRelated information in MedGen
- PubMedPubMedPubMed citations for these articles
- SubstanceSubstancePubChem chemical substance records that cite the current articles. These references are taken from those provided on submitted PubChem chemical substance records.

- Using least median of squares for structural superposition of flexible proteinsUsing least median of squares for structural superposition of flexible proteinsBMC Bioinformatics. 2009; 10()29

Your browsing activity is empty.

Activity recording is turned off.

See more...