Logo of plosonePLoS OneView this ArticleSubmit to PLoSGet E-mail AlertsContact UsPublic Library of Science (PLoS)
PLoS One. 2009; 4(7): e6052.
Published online Jul 8, 2009. doi:  10.1371/journal.pone.0006052
PMCID: PMC2702822

Evidence for the Concerted Evolution between Short Linear Protein Motifs and Their Flanking Regions

Berend Snel, Editor

Abstract

Background

Linear motifs are short modules of protein sequences that play a crucial role in mediating and regulating many protein–protein interactions. The function of linear motifs strongly depends on the context, e.g. functional instances mainly occur inside flexible regions that are accessible for interaction. Sometimes linear motifs appear as isolated islands of conservation in multiple sequence alignments. However, they also occur in larger blocks of sequence conservation, suggesting an active role for the neighbouring amino acids.

Results

The evolution of regions flanking 116 functional linear motif instances was studied. The conservation of the amino acid sequence and order/disorder tendency of those regions was related to presence/absence of the instance. For the majority of the analysed instances, the pairs of sequences conserving the linear motif were also observed to maintain a similar local structural tendency and/or to have higher local sequence conservation when compared to pairs of sequences where one is missing the linear motif. Furthermore, those instances have a higher chance to co–evolve with the neighbouring residues in comparison to the distant ones. Those findings are supported by examples where the regulation of the linear motif–mediated interaction has been shown to depend on the modifications (e.g. phosphorylation) at neighbouring positions or is thought to benefit from the binding versatility of disordered regions.

Conclusion

The results suggest that flanking regions are relevant for linear motif–mediated interactions, both at the structural and sequence level. More interestingly, they indicate that the prediction of linear motif instances can be enriched with contextual information by performing a sequence analysis similar to the one presented here. This can facilitate the understanding of the role of these predicted instances in determining the protein function inside the broader context of the cellular network where they arise.

Introduction

Linear motifs (LMs) are short stretches of amino acids that populate protein sequences and play fundamental roles in protein interaction networks [1]. Their lengths are typically between three and ten amino acids [2], [3]. LMs frequently show wide variation in residue conservation: some positions accept only one or few amino acids while others do not have any preference and function as spacers [4]. These sequence features give to LMs an evolutionary plasticity and an important role in the evolution of cellular networks by the addition of new functionality to proteins [1].

LMs are mainly found in intrinsically unstructured regions of proteins [5]. Disordered regions allow a thermodynamical control of the affinity and specificity of protein interactions. They favour transient, that is to say low affinity, and conditional interactions, often depending on a previous modification like a phosphorylation [6]. Hence the localisation of LMs in disordered regions suits dynamic regulation of protein networks, where a rapid but deterministic response is needed [7]. Indeed, LM–mediated interactions allow the emergence of several regulatory modes (i.e. sequential, mutually exclusive and cooperative) frequently observed in signalling, vesicular trafficking and transcription pathways [8].

Function of LMs strongly depends on the context. An instance of the KDEL motif, which is an endoplasmic reticulum retrieving signal, is likely to be functional only if present in protein sequences known to localise to the ER or Golgi apparatus. On one hand, the context defines the natural constraints that act on LMs and therefore provides “rules” that can be applied to evaluate the reliability of a newly predicted pattern or instance. For example the domain masking strategy, which is used to discard instances occurring in protein regions inaccessible for interaction like globular domains or coiled coils [3], [9], [10], [11].

On the other hand, the context can also give detailed information about the mode of action of LMs. The role of the local amino acid composition in determining specificity of LM interactions has been experimentally studied at the interactome level [12], [13], [14]. At the structural level, unstructured regions flanking LMs have been observed to undergo disorder to order transition upon binding [15], forming either An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e001.jpg -helices [16] or additional An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e002.jpg strands that join a An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e003.jpg sheet of the partner [17]. This coincides with the observation that two thirds of LMs bind to their partners by mutual fit, meaning that they acquire a fixed structure upon binding to a well structured template [1]. Furthermore, a recent survey of 3D structures of protein–peptide complexes has estimated that neighbouring residues account for 20% of the global binding energy of peptide–mediated interactions. They are thought to improve the interaction affinity with the native partner or to impede non–native interactions [18].

The evolutionary context of LMs has also been studied and used in predictive methods. Convergent evolution of LMs is at the basis of discovery algorithms like SLiMFinder [19] and DILIMOT [20], which search for over–represented motifs in unrelated proteins with a common functional attribute. Additionally, conservation of LMs in closely and distantly related proteins has been used to improve the identification of functional instances of known LM patterns [11], [21], [22], [23]. Methods for de novo discovery, have also benefited from the evolutionary signal provided by analysing patterns of conservation. SLiMFinder uses global or local sequence conservation to improve confidence in motif predictions [9], [24]; DILIMOT takes into account conservation of the motif in orthologs as part of the scoring scheme [10].

It is clear that LM predictions from the current generation of predictors require experimental validation to be considered genuine. The methods are often working at the limits of signal to noise and are dependent on the information content of the bioinformatics databases being used for LM prediction [3], [25], [26]. Nevertheless, LM prediction methods could be valuable tools for the study of high dimensional systems like the protein signalling networks. Therefore it is necessary to move from the identification of a LM in a protein towards the prediction of the role of that instance inside the functional framework of the protein, e.g. its network of interactors.

This work addresses the study of LM context from an evolutionary point of view. Conservation patterns of regions flanking 116 LM functional instances were examined in relation to the presence/absence of the LM inside protein families. Both sequence identity and structural tendency of the LM context was analysed. Notwithstanding the difficulty of assessing the generality of the results, due to the fragmentary knowledge about the complete set of cellular LMs, distinct evolutionary patterns were identified. For the majority of the studied instances, conservation of the local amino acid sequence and/or the local structural tendency was found to be differentially distributed between sequence pairs with and without the motif. These findings are supported by examples where the regulation of the LM mediated interaction has been shown to depend on the modifications at neighbouring positions or is thought to benefit from the binding versatility of disordered regions. Taken together, the results of the present study suggest that it is possible to enrich the identification of a LM instance with regulatory information by analysing the conservation pattern of its flanking regions.

Methods

Dataset

The analysis was done using the MAFFT [27] alignments of 75 protein families containing 85 protein sequences that have 116 non–redundant LM instances linked to experimental evidence in the ELM database [3]. Protein families were taken from the TreeFam4.0 database [28]. The 40% of the families in the dataset include proteins of metazoans (vertebrates and invertebrates) and plants (A. thaliana) or yeast (S. cerevisiae and S. pombe); 42% contain vertebrate and invertebrate sequences; the remaining 18% have only vertebrate proteins.

The presence/absence of each instance was determined in the sequences belonging to the protein family by looking for the regular expression of the corresponding LM, as defined in the ELM resource [3]. Sequence pairs in the protein family were assigned to one of the following sets: the presence set (An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e004.jpg), when both sequences have a match to the regular expression in the same position of the annotated ELM instance; the absence set (An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e005.jpg), when the instance is missing in one of the sequences. Only protein sequences having a sub–sequence aligned to the region corresponding to the ELM instance were considered. This classification assumes that a LM instance is functional if it appears in a position that, according to the alignment, corresponds to that of the annotated ELM instance. Moreover, it depends on the adequacy of the ELM regular expression and might overestimate the size of the An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e006.jpg set. Sequence pairs where the instance is absent in both sequences were not considered, since any interpretation about their differences would imply making assumptions about the gain or loss of the instances during the evolution of the protein family.

To perform comparisons between LMs located in similar structural contexts, each instance was assigned to a structural class. The structural class was defined in terms of disorder/order at two levels: protein family and module, where module is defined as an independent unit within the protein sequence with globular or disorder tendency. This classification was done in a semi–automated way, using the IUPred disorder predictor [29] and the SMART module research tool [30] and averaging the results over all the homologous sequences. Proteins were classified as disordered, when more than 70% of their residues are disordered (conservative IUPred threshold of 0.4); globular, when more than 70% of the residues belong to one or more SMART globular modules; mixed, for the proteins that could not be clearly allocated to any of the previous classes. Modules were similarly defined as disordered or globular. The final dataset has instances in all of the 6 structural classes resulting from the combination of protein and module class (see Text S1 for the complete dataset).

Local structure and sequence conservation metrics

Differences between sequences were studied in terms of conservation of the local structural tendency and the amino acid sequence at both local and global level. The conservation of the local structure was calculated for each sequence pair An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e007.jpg as:

equation image

where An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e009.jpg indicates the absolute value of An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e010.jpg; An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e011.jpg is the IUPred value averaged over the amino acids located 15 positions to the left and right of the LM in sequence An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e012.jpg; An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e013.jpg is the standard deviation of An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e014.jpg for all the sequences in the protein family. Therefore, An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e015.jpg indicates whether the difference of the local tendency to disorder/order between A and B is higher or lower than the variability inside the whole protein family. Normalisation by standard deviation permits the comparison among instances belonging to different protein families, which have different IUPred variabilities. The An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e016.jpg varies between −1 and infinity, with negative or small positive values indicating conservation of the local structural tendency around the LM instance.

The protein sequence conservation between each pair An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e017.jpg was calculated as the full-length sequence identity according to the multiple sequence alignment (An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e018.jpg) and as the sequence identity of the amino acids in the 15 positions flanking the LM instance both sides (An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e019.jpg).

The definition of An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e020.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e021.jpg depends on the alignment quality of the flanking regions. Acknowledging the poor performance of multiple alignment programs in disordered regions [31], those values were calculated only when the 15 residue windows surrounding the instance contained at least 75% of non–gap positions; in other words, when there was enough information to estimate average conservation values.

Frequency profiles and correlation between An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e022.jpg An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e023.jpg sets

The distribution of the An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e024.jpg values as a function of the An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e025.jpg or An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e026.jpg was represented as frequency profiles. Those profiles are no more than two-dimensional histograms which represent the number of pairs falling in a given range of the An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e027.jpg and a given range of An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e028.jpg or An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e029.jpg. Counts were normalised to avoid biases due to the different sizes of the protein families. Frequency profiles were calculated for the An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e030.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e031.jpg sets of each instance. Almost half of the instances (53 out of the 116) have a sufficient number of sequence pairs to allow this statistical representation.

In order to compare the similarity between the An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e032.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e033.jpg profiles, their correlation was estimated using the Spearman coefficient. The Spearman coefficient ranges between 1, high correlation, and −1 complete anticorrelation. In the context of the present study, a correlation of 1 would indicate that the An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e034.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e035.jpg sets cover the same An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e036.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e037.jpg/An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e038.jpg ranges. A correlation of −1 would imply that those ranges are completely disjoint and diametrically opposed (e.g. high An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e039.jpg and low An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e040.jpg for An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e041.jpg while low An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e042.jpg and high An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e043.jpg for An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e044.jpg). Small positive or negative values indicate that the An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e045.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e046.jpg/An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e047.jpg ranges of the An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e048.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e049.jpg sets tend to be disjoint but not opposite.

Statistical coupling analysis

Positional coupling [32] between each non–wildcard position of the LM instance and each one of the residues of the module (globular or disordered) was calculated. The method could be applied for the instances located in modules whose multiple sequence alignment is diverse, such that the frequencies of amino acids at some positions are near to their mean values in all proteins, i.e. those positions are poorly conserved. Only positions in the module with coupling values that emerge from noise were considered. Noise threshold was set to two standard deviations above the mean coupling value of all the residues in the module.

Coupled positions were classified as neighbouring, when located within 15 positions both sides of the LM instance, and as distant for all the others. For the instances located towards the limits of the module, the partial window (i.e. less than 15 residues) was considered. In other words, the module boundaries were taken into account when defining neighbouring residues.

Assuming that the probability of coupling is equal for any residue in the protein sequence, the number of coupled positions was weighted by the total number of potentially coupled positions: 30 for the neighbouring residues and the length of the module minus the length of the instance region (15+ motif length +15) for the distant ones. This weighted value is defined as the frequency of coupling.

Results

LM presence and the conservation of the local structural tendency

This section explores the relationship between LM presence and the conservation of the structural tendency in the regions flanking the motif. Figure 1 shows the An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e050.jpg distribution for the pairs of the An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e051.jpg and the An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e052.jpg sets averaged over all the instances. Even if there is a non–negligible overlap between the two distributions, negative An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e053.jpg values, that indicate conservation of the local structural tendency, are significantly more frequent in An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e054.jpg than in An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e055.jpg sequence pairs (Kolmogorov-Smirnov test: difference = 0.423, p-valueAn external file that holds a picture, illustration, etc.
Object name is pone.0006052.e056.jpg0.00001). This difference is lost for higher An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e057.jpg values.

Figure 1
Frequency distribution of IU Pdiff for the PLM and ALM sets.

When the analysis is repeated comparing the An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e058.jpg distributions of An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e059.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e060.jpg sets of each instance, inside each protein family, analogous results are obtained. For all the structural classes the mean An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e061.jpg for the An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e062.jpg set is lower than that of the An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e063.jpg set, as shown in Table 1. Additionally, comparison of the two An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e064.jpg distributions gives statistically significant differences for 57 out of 116 instances (Kolmogorov-Smirnov test: differences between 0.303 and 0.791, p-valuesAn external file that holds a picture, illustration, etc.
Object name is pone.0006052.e065.jpg0.05, see complete results in Table S1). This means that, for almost 50% of the instances the An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e066.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e067.jpg sets have different local structural tendencies that can be quantified and used to statistically differentiate between those sequence pair sets.

Table 1
IU Pdiff ranges and mean IU Pdiff for the PLM and ALM sets per structural class.

For the remaining instances the An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e068.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e069.jpg sets have the same An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e070.jpg ranges. These instances suggest that, sometimes, the local structure is conserved even if the LM is lost. This is not surprising if considering that the LM is a module evolving inside a higher order unit (e.g. the protein sequence) composed of several other functional modules. Disambiguation of the selective pressure imposed by the LM, based exclusively in its local structure conservation, will be difficult in these cases. Consequently it is worth analysing the conservation of the local structural tendency in relation to the evolution of the rest of the protein modules.

LM evolution and the relationship between local structural tendency and sequence conservation

In order to explore how the conservation of the local structure, in terms of disorder/order, is related to the evolution of the protein sequence, the distribution of An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e071.jpg was analysed as a function of the global and local sequence conservation. Frequency profiles of the combined distribution of An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e072.jpg versus the local and global sequence conservation (An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e073.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e074.jpg) were calculated for both the An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e075.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e076.jpg sets of each instance.

Figure 2 presents the frequency profile of An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e077.jpg versus An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e078.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e079.jpg. Since they represent the distribution of the above variables for the An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e080.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e081.jpg sets averaged over all the instances, those profiles do not allow a comparative analysis between An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e082.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e083.jpg sets or sequence conservation variables. Differences among protein families due to dissimilar evolutionary rates are not averaged out. The structural composition of proteins belonging to different structural classes (disordered, globular, mixed) might add further disparity, since sequences with long disordered regions tend to have heterogeneous evolutionary rates [33].

Figure 2
Frequency profiles for the PLM and ALM sets.

Nevertheless those profiles provide an idea about the general trends of the relationship between An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e084.jpg and sequence conservation. As expected, the An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e085.jpg sets cover mainly low sequence conservation values (Figure 2B and D). Indeed, even if low sequence similarity does not necessarily imply the loss of the LM, closely related protein sequences are more likely to have similar LM instances than distantly related or paralogous sequences [1], [4]. Instead, the frequency profiles of the An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e086.jpg sets exhibit an additional feature: low An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e087.jpg values are frequent in both high and low sequence conservation values (Figure 2A and C). In other words, conservation of the amino acid sequence is not required for the maintenance of the disorder tendency around the LM.

The above result suggests that structural and sequence conservation, intended as sequence identity, are not redundant and both might provide information about the LM evolution. Indeed the IUPred method predicts disordered/ordered regions by estimating the total pair wise interresidue interaction energy [29] and therefore there is no a priori reason why the conservation of the local structural tendency should imply the conservation of the exact amino acid sequence. To further explore this, the frequency profiles of the An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e088.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e089.jpg sets of each instance were obtained and their Spearman correlation coefficient calculated separately. The analysis per instance has the additional advantage of discarding artificial differences between An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e090.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e091.jpg caused by dissimilar evolutionary rates among the protein families.

All the structural classes have low mean correlation coefficients indicating that, on average, the An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e092.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e093.jpg frequency profiles of each instance can be discriminated; correlation values range from 0.11 to 0.34 for An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e094.jpg and from 0.02 to 0.22 for An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e095.jpg depending on the structural class (see Table S2). The low number of instances per structural class, makes any comparative statistical analysis unreliable, e.g. between structural classes or conservation variables. Nevertheless, having a closer look at the results per instance (Table 2), three groups with distinct behaviour can be identified. Examples of instances belonging to each one of those groups are presented in Figure 3. Those trends do not change when the An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e096.jpg set is enlarged by considering subsequences that partially match the ELM regular expression as LM instances (see Table S3 for further details).

Figure 3
Examples of evolutionary patterns of the regions flanking LM.
Table 2
Spearman correlation coefficient between the PLM and ALM frequency profiles.

The first group consists of those instances whose An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e097.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e098.jpg frequency profiles of An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e099.jpg versus An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e100.jpg are less correlated than the corresponding An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e101.jpg versus An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e102.jpg profiles (Figure 3A). This indicates that variations in the local protein sequence are more connected to the LM presence/absence than the modifications happening in the rest of the protein. The 37% of the instances in Table 2 have this kind of behaviour, especially those ones located in disordered modules of disordered proteins (8 out of 13).

The second group is formed of instances where the contrary is true, meaning that the LM presence/absence is better distinguished by the global conservation (Figure 3B). In those cases, the main selective pressure on the LM presence might be coming from the protein sequence as a whole unit. Not surprisingly all of the 8 instances located in globular proteins (both in disordered and globular modules) belong to this group.

A third group of instances appears when merging the results of the previous section, that is to say, considering those instances whose An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e103.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e104.jpg sets have significantly different An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e105.jpg distributions (in bold in Table 2, Figure 3C and D). In these cases, the presence or absence of the LM is correlated with changes in both the local structural tendency and the sequence conservation. Those instances reach, on average, lower correlation values independently from the conservation variable (0.18 for the An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e106.jpg and 0.15 for the An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e107.jpg) than the instances with no significant An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e108.jpg distinction between An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e109.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e110.jpg (0.30 for An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e111.jpg and 0.26 for An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e112.jpg). This last group of instances is the best evidence in favour of the hypothesis proposed above, about the additive value of the structural and sequence conservation information in the analysis of LM evolution.

Co-evolution of the LM and their flanking regions

To get additional evidence about the co–evolution between LMs and their flanking regions, the statistical coupling [32] was used as an independent method. This method has been used to identify clusters of positions that statistically co–vary with one another and therefore are likely to co–evolve and to be functionally related [34]. In this case only pair coupling between the non–wildcard positions of the LM instance and all the other residues in the corresponding module was considered. The frequency of coupling with neighbouring and distant residues was calculated and compared in terms of the sequence conservation that best describes the LM evolution, that is to say the variable that gives the lowest correlation in Table 2.

For the instances that have lower An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e113.jpg correlation (e.g. Figure 3A), the frequency of neighbouring coupling is significantly higher (Kolmogorov-Smirnov test: difference = 0.576, p-valueAn external file that holds a picture, illustration, etc.
Object name is pone.0006052.e114.jpg0.005) than the frequency of distant coupling (Figure 4A). In other words, the instances whose evolution is better described by the local sequence conservation combined with the An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e115.jpg have a higher chance of correlated amino acid changes with neighbouring rather than with distant residues in the module. Conversely, for the instances where the global sequence conservation is the better descriptor (e.g. Figure 3B), the coupling between non–wildcard positions and neighbouring or distant positions is equally frequent (Figure 4B).

Figure 4
Frequency of coupling between LM and neighbouring or distant residues.

Discussion

This study presents evidence for the concerted evolution of LMs and their flanking regions. Although the current knowledge of the complete set of cellular LMs is fragmentary and it is not possible to assess the representativity of the analysed dataset, there are clear trends that are worth considering. LMs are known to be evolutionarily labile modules, which can be easily lost by point mutation [4]. Nonetheless, the results of the present study show that LMs, in some cases, determine the conservation of the structural tendency and/or the sequence of the neighbouring amino acids. Here those findings are discussed in the light of the protein interactions mediated by LMs.

In the first section of the Results it was shown that, for some instances, the conservation of the LM is associated with the maintenance of the structural tendency of the surrounding residues. What is the meaning of this conservation? As mentioned in the Introduction, two thirds of the LM–mediated interactions lead to the formation of secondary structure elements (An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e116.jpg–helices or An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e117.jpg–strands) [1]. If the LM functionality is to be maintained, the structural properties of the neighbouring amino acids that allow such disorder/order transition are likely to be conserved. This local propensity would be reflected by the corresponding IUPred values and hence the low An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e118.jpg observed in the An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e119.jpg sets would indicate the conservation of such propensity.

However, the conservation of the local structural tendency could also indicate the maintenance of the local disorder. Several studies on protein–protein interactions have drawn attention to the importance of intrinsic disorder in the formation of protein complexes [6], [35], [36], [37]. If the local disorder provides the flexibility required to bind different patterns, it is not surprising to observe the conservation of this structural tendency in the regions involved in such interactions. Previous work by [38] has connected the conservation of predicted disordered regions in eukaryotic proteins with DNA/RNA binding domains. The conservation of disorder around LMs would extend this result to a broader set of biological processes.

The instances of the molecular hub p53 exemplify the double meaning of the structural conservation measured by the IU Pdiff. For three out of four of the p53 instances in the dataset (TRG_NES_CRM1_1, 339–352; MOD_SUMO, 385–388; MOD_PIKK_1, 12–18), the presence of the instance coincides with the conservation of the local structural tendency. They belong to the group of instances that have a significantly different distribution of the An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e120.jpg between An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e121.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e122.jpg sets (p-valueAn external file that holds a picture, illustration, etc.
Object name is pone.0006052.e123.jpg0.05). Those instances are located in the C and N terminal regions of P53, which are disordered modules known to bind different partners by acquiring different conformations [39]. Additionally, the MOD_SUMO and the MOD_PIKK_1 (but not the TRG_NES_CRM1_1) occur in predicted α–MoREs, disordered regions having propensities to form α–helix upon molecular recognition [16].

A more detailed study of the structural conservation as function of the different types of mutual fit interaction (i.e. α–helix formation, An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e124.jpg augmentation or irregular topology) may be interesting. It would shade light on the specific requirements of each conformation. This would require the definition of a more elaborated metric for the local structure conservation than the An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e125.jpg. However, independently from its specific meaning, the structural tendency conservation around the LM suggests the occurrence of overlapping interaction surfaces. Those clustered overlaps are likely to entail different regulatory mechanisms for the spatial or temporal isolation of the mutually exclusive interactions.

In the second and third part of the Results it was shown that the presence of some LM instances is accompanied by the conservation of the amino acids flanking the motif. This is the case for 42% of the instances in Table 2 that have An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e126.jpg correlation values lower that 0.20 between the An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e127.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e128.jpg sets. The local sequence conservation could be explained in some cases by the conservation of the local structural tendency (instances in bold in Table 2, Figure 3C and D). Still, as shown in the Results (Figure 2), sequence identity does not seem to be a requirement for the maintenance of the local order/disorder tendency. Indeed, it has been recently demonstrated by nuclear magnetic resonance spectroscopy that intrinsically disordered regions can maintain their dynamic behaviour despite low sequence similarity [40]. Yet there must be a functional meaning for the local sequence conservation associated with these instances, especially considering that it allows to discriminate sequences with and without the motif (An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e129.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e130.jpg sets), even when local structural tendencies between those sequences are not significantly different (e.g. Figure 3A and B). Furthermore, these instances have higher chance of co–evolving with the neighbouring residues in comparison to the distant ones (Figure 4A).

It is likely that the flanking regions of those instances are related with the regulation of the LM or with the regulation of another interaction, which is functionally connected to the one mediated by the motif. This is the case of the LIG_AP2alpha_1 in positions 324–328 of amphiphysin (P49418, An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e131.jpg correlation 0.03), which is involved in clathrin coated vesicle formation. Phosphorylation of amphiphysin by Cdk5 in S276, S285 and T310 has been shown to directly regulate the intramolecular interaction in amphiphysin, which in turn regulates dynamin-dependent endocytosis [41], [42]. Likewise, other instances with An external file that holds a picture, illustration, etc.
Object name is pone.0006052.e132.jpg correlation between −0.05 and 0.16 (LIG_SH3_1 P10636 565–572, LIG_COP1 P17535 241–248) have experimentally verified phosphorylation sites in their flanking regions: T561 for P17535 and S251, S255 and S259 for P17535 [25]. Those phosphorylation site are likely to regulate the local protein conformation and activity, as recently shown in a phosphoproteomic analysis of the mouse brain cytosol [43].

Finally, it is opportune to consider how current LM prediction methods can benefit from these results. A simple sequence analysis, similar to the one described here, would allow the identification of flanking regions with relevant conservation patterns, adding contextual information to already predicted LM instances. This can lead to a more detailed understanding of the role of LMs in determining the protein function. Indeed we consider that the LM field is ready – and has the potential – to go one step further from the timeless binary interactions towards the construction of more dynamic and realistic protein networks.

Supporting Information

Text S1

Dataset of functional instances. List of the 116 instances, classified per structural class with phylogeny, sequence and motif identifiers.

(0.00 MB TXT)

Table S1

Comparison of the IUPdiff distribution between the PLM and ALM sets. Kolmogorov-Smirnov test comparing the IUPdiff distribution of the PLM and ALM sets of each instance. The difference is the Kolmogorov-Smirnov statistic calculated from the cumulative distributions of the compared samples.

(0.03 MB PDF)

Table S2

Mean and standard deviation of the correlation between PLM and ALM frequency profiles. Spearman correlation coefficient calculated between the PLM and ALM frequency profiles of each instance. Correlation of the frequency profiles of IUPdiff versus locCons and IUPdiff versus globCons are indicated as locCons corr and globCons corr respectively.

(0.02 MB PDF)

Table S3

Effect of the stringency of the regular expression matching on the correlation between the PLM and ALM frequency profiles. Spearman correlation coefficient calculated between the PLM and ALM frequency profiles of each instance. Correlation of the frequency profiles of IUPdiff versus locCons and IUPdiff versus globCons are indicated as locCons corr and globCons corr respectively. Percentages indicate the stringency used to define a match to the ELM regular expression: 100% stringency supposes that a LM is present only if there is a perfect match to the ELM regular expression in the same position of the annotated instance; lower percentages consider that a LM is present also in case of partial match to the regular expression. Correlation values in bold show the biggest difference (more than 0.05) with the corresponding 100% stringency correlation value. Missing values can not be calculated due insufficient number of sequence pairs in the ALM set.

(0.05 MB PDF)

Acknowledgments

The authors would like to thank Steve W. Lockless and Rama Ranganathan for providing the code for the SCA implementation, Aidan Budd and Daniel Castaño for fruitful discussion at the beginning of the project, Niall Haslam for critical reading of the manuscript and Norman Davey for pointing out at “flanking”.

Footnotes

Competing Interests: The authors have declared that no competing interests exist.

Funding: This work was partially supported by the EU EMBRACE (LHSG-CT-2004-512091) grant. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Diella F, Haslam N, Chica C, Budd A, Michael S, et al. Understanding eukaryoticlinear motifs and their role in cell signaling and regulation. Front Biosci. 2008;13:6580–603. [PubMed]
2. Sigrist C, Cerutti L, Hulo N, Gattiker A, Falquet L, et al. PROSITE: a documented database using patterns and profiles as motif descriptors. Brief Bioinform. 2002;3(3):265–74. [PubMed]
3. Puntervoll P, Linding RC, Chabanis-Davidson GS, Mattingsdal M, et al. ELM server: A new resource for investigating short functional sites in modular eukaryotic proteins. Nucleic Acids Res. 2003;31(13):3625–30. [PMC free article] [PubMed]
4. Neduva V, Russell R. Linear motifs: evolutionary interaction switches. FEBS Lett. 2005;579(15):3342–3345. [PubMed]
5. Fuxreiter M, Tompa P, Simon I. Local structural disorder imparts plasticity on linear motifs. Bioinformatics. 2007;23(8):950–6. [PubMed]
6. Wright P, Dyson H. Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. J Mol Biol. 1999;293(2):321–31. [PubMed]
7. Gibson T. Cell regulation: determined to signal discrete cooperation. Trends Biochem Sci. 2009 (in press) [PubMed]
8. Seet B, Dikic I, Zhou M, Pawson T. Reading protein modifications with interaction domains. Nat Rev Mol Cell Biol. 2006;7(7):473–83. [PubMed]
9. Davey N, Shields D, Edwards R. SLiMDisc: short, linear motif discovery, correcting for common evolutionary descent. Nucleic Acids Res. 2006;34(12):3546–54. [PMC free article] [PubMed]
10. Neduva V, Linding R, Su-Angrand I, Stark A, de F Masi, et al. Systematic discovery of new recognition peptides mediating protein interaction networks. PLoS Biol. 2006;3(12):e405. [PMC free article] [PubMed]
11. Dinkel H, Sticht H. A computational strategy for the prediction of functional linear peptide motifs in proteins. Bioinformatics. 2007;23(24):3297–303. [PubMed]
12. Landgraf C, Panni S, Montecchi-Palazzi L, Castagnoli L, Schneider-Mergener J, et al. Protein interaction networks by proteome peptide scanning. PLoS Biol. 2004;2(1):e14. [PMC free article] [PubMed]
13. Stiffler M, Chen J, Grantcharova V, Lei Y, Fuchs D, et al. PDZ domain binding selectivity is optimized across the mouse proteome. Science. 2007;317(5836):364–9. [PMC free article] [PubMed]
14. Zarrinpar A, Park S, Lim W. Optimization of specificity in a cellular protein interaction network by negative selection. Nature. 2003;426(6967):676–80. [PubMed]
15. Mohan A, Oldfield C, Radivojac P, Vacic V, Cortese M, et al. Analysis of molecular recognition features (MoRFs). J Mol Biol. 2006;362(5):1043–59. [PubMed]
16. Oldfield C, Cheng Y, Cortese M, Romero P, Uversky V, et al. Coupled folding and binding with alpha-helix-forming molecular recognition elements. Biochemistry. 2005;44(37):12454–70. [PubMed]
17. Remaut H, Waksman G. Protein–protein interaction through beta–strand addition. Trends Biochem Sci. 2006;31:436–444. [PubMed]
18. Stein A, Aloy P. Contextual specificity in peptide-mediated protein interactions. PLoS ONE. 2008;3(7):e2524. [PMC free article] [PubMed]
19. Edwards R, Davey N, Shields D. SLiMFinder: A probabilistic method for identifying over-represented, convergently evolved, short linear motifs in proteins. PLoS ONE. 2007;2(10):e967. [PMC free article] [PubMed]
20. Neduva V, Russell R. DILIMOT: discovery of linear motifs in proteins. Nucleic Acids Res. 2006;34(Web Server issue):W350–5. [PMC free article] [PubMed]
21. Chica C, Labarga A, Gould C, López R, Gibson T. A tree-based conservation scoring method for short linear motifs in multiple alignments of protein sequences. BMC Bioinformatics. 2008;9:229. [PMC free article] [PubMed]
22. Balla S, Thapar V, Verma S, Luong T, Faghri T, et al. Minimotif Miner: a tool for investigating protein function. Nat Methods. 2006;3(3):175–7. [PubMed]
23. Gutman R, Berezin C, Wollman R, Rosenberg Y, Ben-Tal N. QuasiMotiFinder: protein annotation by searching for evolutionarily conservedmotif-like patterns. Nucleic Acid Res. 2005;33(Web Server issue):W255–61. [PMC free article] [PubMed]
24. Davey N, Shields D, Edwards R. Masking residues using context-specific evolutionary conservation significantly improves short linear motif discovery. Bioinformatics. 2009;25(4):443–450. [PubMed]
25. Diella F, Gould C, Chica C, Via A, Gibson T. Phospho.ELM: a database of phosphorylation sites–update 2008. Nucleic Acids Res. 2008;36:D240–4. [PMC free article] [PubMed]
26. Obenauer J, Cantley L, Yaffe M. Scansite 2.0: Proteome-wide prediction of cell signaling interactions using short sequence motifs. Nucleic Acids Res. 2003;31(13):3635–41. [PMC free article] [PubMed]
27. Katoh K, Misawa K, Kuma K, Miyata T. MAFFT: a novel method for rapid multiple sequence alignment based on fast fourier transform. Nucleic Acids Res. 2002;30(14):3059–66. [PMC free article] [PubMed]
28. Li H, Coghlan A, Ruan J, Coin L, Hériché J, et al. TreeFam: a curated database of phylogenetic trees of animal gene families. Nucleic Acids Res. 2006;34:D572–80. [PMC free article] [PubMed]
29. Dosztányi Z, Csizmók V, Tompa P, Simon I. IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics. 2005;21(16):3433–4. [PubMed]
30. Letunic I, Doerks T, Bork P. SMART 6: recent updates and new developments. Nucleic Acids Res. 2009;37:D229–32. [PMC free article] [PubMed]
31. Perrodou E, Chica C, Poch O, Gibson T, Thompson J. A new protein linear motif benchmark for multiple sequence alignment software. BMC Bioinformatics. 2008;9:213. [PMC free article] [PubMed]
32. Lockless S, Ranganathan R. Evolutionarily conserved pathways of energetic connectivity in protein families. Science. 1999;286(5438):295–9. [PubMed]
33. Brown C, Takayama S, Campen A, Vise P, Marshall T, et al. Evolutionary rate heterogeneity in proteins with long disordered regions. J Mol Evol. 2002;55(1):104–10. [PubMed]
34. Lockless S, Zhou M, MacKinnon R. Structural and thermodynamic properties of selective ion binding in a K+ channel. PLoS Biol. 2007;5(5):e121. [PMC free article] [PubMed]
35. Dyson H, Wright P. Intrinsically unstructured proteins and their functions. Nat Rev Mol Cell Biol. 2005;6(3):197–208. [PubMed]
36. Tompa P, Fuxreiter M. Fuzzy complexes: polymorphism and structural disorder in protein-protein interactions. Trends Biochem Sci. 2008;33(1):2–8. [PubMed]
37. Hegyi H, Schad E, Tompa P. Structural disorder promotes assembly of protein complexes. BMC Struct Biol. 2007;7:65. [PMC free article] [PubMed]
38. Chen J, Romero P, Uversky V, Dunker A. Conservation of intrinsic disorder in protein domains and families: II. functions of conserved disorder. J Proteome Res. 2006;5(4):888–98. [PMC free article] [PubMed]
39. Uversky V, Oldfield C, Dunker A. Showing your ID: intrinsic disorder as an ID for recognition, regulation and cell signaling. J Mol Recognit. 2005;18(5):343–84. [PubMed]
40. Daughdrill G, Narayanaswami P, Gilmore S, Belczyk A, Brown C. Dynamic behaviour of an intrinsically unstructured linker domain is conserved in the face of negligible amino acid sequence conservation. J Mol Evol. 2007;65(3):277–88. [PubMed]
41. Takei K, Yoshida Y, Yamada H. Regulatory mechanisms of dynamin–dependent endocytosis. J Biochem. 2005;137(3):243–7. [PubMed]
42. Tomizawa K, Sunada S, Lu Y, Oda Y, Kinuta M, et al. Cophosphorylation of amphiphysin i and dynamin i by cdk5 regulates clathrin-mediated endocytosis of synaptic vesicles. J Cell Biol. 2003;163(4):813–24. [PMC free article] [PubMed]
43. Collins M, Yu L, Campuzano I, Grant S, Choudhary J. Phosphoproteomic analysis of the mouse brain cytosol reveals a predominance of protein phosphorylation in regions of intrinsic sequence disorder. Mol Cell Proteomics. 2008;7(7):1331–48. [PubMed]

Articles from PLoS ONE are provided here courtesy of Public Library of Science
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • MedGen
    MedGen
    Related information in MedGen
  • Protein
    Protein
    Published protein sequences
  • PubMed
    PubMed
    PubMed citations for these articles
  • Substance
    Substance
    PubChem Substance links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...