Logo of pnasPNASInfo for AuthorsSubscriptionsAboutThis Article
Proc Natl Acad Sci U S A. 2008 May 20; 105(20): 7177–7181.
Published online 2008 May 13. doi:  10.1073/pnas.0711151105
PMCID: PMC2438223

Prediction of membrane-protein topology from first principles


The current best membrane-protein topology-prediction methods are typically based on sequence statistics and contain hundreds of parameters that are optimized on known topologies of membrane proteins. However, because the insertion of transmembrane helices into the membrane is the outcome of molecular interactions among protein, lipids and water, it should be possible to predict topology by methods based directly on physical data, as proposed >20 years ago by Kyte and Doolittle. Here, we present two simple topology-prediction methods using a recently published experimental scale of position-specific amino acid contributions to the free energy of membrane insertion that perform on a par with the current best statistics-based topology predictors. This result suggests that prediction of membrane-protein topology and structure directly from first principles is an attainable goal, given the recently improved understanding of peptide recognition by the translocon.

Keywords: bioinformatics, membrane insertion, topology prediction, translocon, biological hydrophobicity scale

Prediction of membrane-protein topology is a classic problem in bioinformatics (1). The very first prediction algorithms were based solely on hydrophobicity plots (2), but these early methods performed poorly in practice and were soon supplanted by machine-learning methods that extract statistical sequence preferences from databases of experimentally mapped topologies. Today, the best performing methods have been trained on extensive datasets and contain hundreds of free parameters that are optimized during the training session (3, 4). With the inclusion of information from aligned homologous sequences, one can expect modern methods to predict the correct topology for up to 80% of all multispanning membrane proteins (5, 6).

Yet, from a basic science point of view, it is somewhat unsatisfying that the best methods use sequence statistics rather than physicochemical principles as the underlying basis for the prediction. After all, the cellular machineries (translocons) responsible for membrane-protein biogenesis do not have access to statistical data but rather exploit molecular interactions (lipid–protein, water–protein, and protein–protein) to ensure that membrane proteins attain their correct topology (7, 8). In principle, therefore, it should be possible to match the performance of current machine-learning predictors by using methods based on the same physical properties that determine translocon-mediated membrane insertion.

Despite years of biophysical studies of protein–lipid interactions (912), it is only recently that the first comprehensive dataset describing the insertion of transmembrane (TM) α-helices into the endoplasmic reticulum (ER) membrane in terms of free-energy contributions from individual amino acids in different positions along the membrane normal has been published (13). Here, we show that a simple additive free-energy model derived from these experimental data, when coupled with the “positive-inside” rule (14), predicts the topology of helix-bundle membrane proteins with performance levels rivaling the best machine-learning methods. This represents an important step in our ability to predict protein structure from sequence by using physical principles and further indicates that insertion free energies measured in the mammalian ER membrane are, to a first approximation, generally applicable for TM helices in membrane proteins from a wide variety of organisms and organelles.


Experimental Data.

The underlying experimental data used in the new topology-prediction algorithms described below is from a recent study of the energetics of insertion into the ER membrane of a single TM segment embedded within a larger protein (13). In that work, position-dependent contributions to the overall apparent free energy of membrane insertion (ΔGapp) for each of the 20 naturally occurring amino acids were derived from insertion experiments in which a given kind of residue was “scanned” across a model TM segment composed of Ala and Leu residues. Simple Gaussian functions were then fitted to the raw data points, producing free-energy profiles describing the contribution of each amino acid to ΔGapp. In the same study, the dependence of ΔGapp on the total length of the TM segment was also determined. Finally, it was shown that measured ΔGapp values for >400 natural and designed TM segments were well approximated by the following expression:

equation image

where l is the length of the segment and ΔGappaa(i) is the contribution (in kcal/mol) from amino acid aa in position i. The expression under the square root is the hydrophobic moment (15) of the segment, and the last three terms model the length-dependence. Eq. 1 provides the basic physical model used in the new topology predictors. All parameter values in Eq. 1 have been derived directly from experimental data (13) and were not optimized on the proteins in the benchmark sets (see below).

TopPredΔG—a Simple Hydrophobicity Plot-Based Predictor.

As a first attempt to use Eq. 1 to predict membrane-protein topology, we developed a variant of the simple TopPred algorithm (14), TopPredΔG. Briefly, a sliding window of fixed length (l = 21 residues) is scanned across the protein sequence, and Eq. 1 is used to generate a curve of calculated ΔGapp values against sequence position, Fig. 1. Local minima in the curve represent candidate TM segments. First, all minima below a lower cutoff value (ΔGlow) are identified and marked as “certain” TM segments, whereas all minima above ΔGlow but below a second higher cutoff value (ΔGhigh) are marked as “putative” TM segments. Second, all possible topologies, including all certain TM segments and either including or excluding each of the putative TM segments, are generated, and the topology that best complies with the positive-inside rule (14) is chosen as the final prediction.

Fig. 1.
Topology predictions by TopPredΔG and SCAMPI. Each image contains, from top to bottom, the crystal structure of the protein chain; the ΔGapp profile, with the TopPredΔG parameters ΔGhigh and ΔGlow shown as dashed ...

The two free parameters in the model (ΔGlow and ΔGhigh) were optimized over a benchmark set of known transmembrane topologies (see below). To find the optimal values, a grid search was performed among values in the overlap region between the ΔGapp distributions for transmembrane and nonmembrane protein segments (13) (see Methods). The optimum is broad, and small variations in ΔGlow and ΔGhigh give rise to no or small variations in the number of correctly predicted topologies (data not shown).

Because it had previously been found that topology predictions can be improved by taking evolutionary information into account (5, 6, 16), we also developed a version of TopPredΔG that takes a multiple-sequence alignment as input and calculates ΔGapp contributions from a weighted average of the corresponding alignment-column distribution (see Methods).

SCAMPI—a Simple Model-Based Predictor.

As a comparison, we also designed a simple generic topology model (SCAMPI), similar to a hidden Markov model in the sense that states and state transitions are used to define an underlying grammar. The architecture of the model consists of only four different modules corresponding to the membrane compartment (M), inside (I) and outside (O) loops, and the part of inside loops closest to the membrane boundary (i). The rationale for the last module is to make it possible to model the observed overrepresentation of positively charged residues in short cytoplasmic loops, i.e., the positive-inside rule (14).

The SCAMPI model does not use transition probabilities, and a flat distribution is used for amino acid emission probabilities in the I and O modules. In the i module, a flat distribution is also used, except that the emission probability for the positively charged residues Lys and Arg is optimized. In the M module, Eq. 1 is used to predict the free energy of insertion, and this energy is then converted to the corresponding estimated insertion probability and used as the emission probability. A detailed description of SCAMPI is available in Methods.

As in TopPredΔG, there are only two parameters in SCAMPI that are optimized on the training data: the emission probability of Arg and Lys in the i module (pKR) and a constant (ΔGshift) that is subtracted from ΔGapp in the M module before this energy is converted to the corresponding insertion probability, reflecting the fact that a relatively large fraction of the TM helices in multispanning membrane proteins have ΔGapp values >0 kcal/mol and are thus not expected to insert efficiently by themselves (13). As before, we performed a grid search to find the optimal values of these two parameters. We also developed a version of SCAMPI that predicts topology from a multiple-sequence alignment rather than a single sequence (see Methods).


To compare TopPredΔG and SCAMPI with each other and with available state-of-the-art topology predictors, we collected one “high-resolution” set of 123 homology-reduced membrane-protein chains with known 3D structure and another nonoverlapping “low-resolution” set of 146 homology-reduced membrane-protein chains with experimentally known topology (see Methods). The parameters in TopPredΔG and SCAMPI were optimized on the low-resolution set and used for topology prediction in the high-resolution set, together with 10 other frequently used topology-prediction methods. The results are summarized in Table 1[see also supporting information (SI) Table S1]. Remarkably, TopPredΔG and SCAMPI both perform on a par with the best statistics-based topology predictors in both single- and multiple-sequence mode. Compared with TopPred II (14), TopPredΔG makes fewer over- and underpredictions but also predicts fewer inverted topologies (see Table S1), indicating that the positive-inside rule is sufficient to determine overall orientation if the TM helices are correctly identified. Consistent with earlier studies (5, 6, 16), we find that the use of multiple-sequence alignments increases performance compared with single-sequence-based predictions.

Table 1.
Fraction of correctly predicted topologies for single- and multiple-sequence versions of the different prediction methods on the high-resolution benchmark set of 123 PDB chains

To determine how well Eq. 1 discriminates between membrane proteins and nonmembrane proteins, we searched each protein in the benchmark set and a collection of 1,060 cytoplasmic and secreted mammalian proteins (excluding signal peptides) for their most hydrophobic segment (lowest ΔGapp) and determined the cutoff that best separated the two distributions. Both the single- and multiple-sequence versions of Eq. 1 reached sensitivities and specificities of 95–97% in this test (Table 2), comparable with the results for Phobius (17) and the neural network-based prefilter used by MEMSAT3 (6).

Table 2.
Discrimination between membrane proteins and globular proteins


To test the extent to which our current understanding of the physicochemical basis of membrane-protein assembly helps predict membrane-protein topology, we have developed two simple prediction methods based directly on experimental measurements of TM helix recognition in vivo. The predictors were built around a recently developed position-specific membrane-insertion propensity scale (13) while minimizing the number of free parameters needed to be determined from sequence statistics. Both predictors perform on a par with the current best topology prediction methods in our benchmark on high-resolution membrane protein structures.

In total, 28 of 454 helices (6%) in the benchmark set are underpredicted by the single-sequence version of SCAMPI, and 13 helices are overpredicted. By using multiple-sequence information, these numbers are reduced to 8 (2%) underpredictions and 10 overpredictions. Generally, the “missed helices” (Table S2) have both higher ΔGapp values (average ΔGapp = 4.4 kcal/mol vs. 0.76 kcal/mol) and a higher fraction of surface area in contact with the surrounding protein (67% buried vs. 54% buried surface area) than found for the complete dataset. In fact, there is a strong tendency for highly exposed helices to have lower ΔGapp values (Fig. 2A), indicating that such helices need to be able to insert efficiently by themselves, in the absence of stabilizing interactions with surrounding protein. As is clear from Fig. 2A, a good part of the surface of the high-ΔGapp helices is buried already within the same polypeptide chain. However, the mean ΔGapp for the most exposed group of helices (0–20% buried) is considerably higher when considering area buried against the chain than against the whole protein complex, indicating that a number of helices with relatively high ΔGapp are efficiently buried (>20%) only upon oligomerization.

Fig. 2.
ΔGapp correlates with TM helix environment. (A) Correlation between ΔGapp and TM helix surface accessibility. White bars, fraction of surface area in contact with the whole protein; black bars, fraction of surface area in contact with ...

On the same line of thought, there should be more opportunities for helix–helix interactions in proteins containing many TM helices, and such helices might thus be expected to be more polar on average. Indeed, the mean ΔGapp increases with the number of TM helices in the protein (Fig. 2B). Among the overpredicted helices (Table S3), more than half are reentrant regions (18), i.e., they partly penetrate the membrane but enter and exit from the same side.

The fact that many TM helices in multispanning (but not single-spanning) membrane proteins have ΔGapp > 0 kcal/mol (13) implies that the “ultimate” free energy-based topology predictor must be able to model helix–helix interaction energies, i.e., it must incorporate important elements of the 3D structure. Further, marginally stable helices and reentrant loops may not always be recognized by the translocon but may be inserted into the membrane at later stages of the folding process (19). In TopPredΔG and SCAMPI, this remaining complication is addressed by the introduction of the free parameters ΔGhigh, ΔGlow, and ΔGshift, whereas the contributions from charged amino acids flanking the TM helices are included in the ranking of possible topology models (TopPredΔG) or by the free parameter pKR (SCAMPI).

In summary, we find that Eq. 1—a simple expression derived from experimentally measured free energies of membrane insertion—together with the positive-inside rule can be used quite successfully to predict membrane-protein topology with minimal use of additional parameter optimization. SCAMPI is freely available as part of a consensus topology prediction web server at http://topcons.cbr.su.se/.



A sliding window of fixed length (l = 21) is scanned across the sequence, and Eq. 1 is used to generate ΔGapp values for each position in the sequence (numbered according to the position of the central residue in the window). Starting with the position corresponding to the lowest ΔGapp value in the sequence, the corresponding 21-residue window is masked from the sequence, the next lowest ΔGapp value is found, and so on until no unmasked 21-residue stretch remains with a ΔGapp value below an upper cutoff value (ΔGhigh). All masked segments with ΔGapp value below a lower cutoff value (ΔGlow) are marked as certain TM segments, whereas the remaining n segments are marked as putative TM segments.

The 2n possible topologies, including all certain TM segments and either including or excluding each of the n putative TM segments, are then generated, and the topology with the largest (K + R) bias, defined as the number of Arg or Lys residues in all inside loops within 12 residues from the membrane boundary minus the corresponding number for all outside loops, is chosen. If the largest (K + R) bias is shared by more than one topology, the one with fewer TM helices is chosen (there are nine such cases in the benchmark set, three of which are correctly predicted by this rule).

When sequence profiles are used as input, Eq. 1 is modified by replacing ΔGappaa(i) with

equation image

where the sum goes over all amino acids and f(aa(i)) is the frequency of amino acid aa at position i. For calculating the (K + R) bias, the number of Arg or Lys residues is replaced by the sum of the relative frequencies of Lys + Arg residues in the columns within 12 residues of the membrane boundary.

A Scale-Based Method for Prediction of Integral Membrane Proteins (SCAMPI).

The topology model in SCAMPI consists of four modules, interconnected as shown in Fig. S1:

  • Outside loop (O). Consists of a single state with a flat emission-probability distribution, paa = 0.05 for all amino acids.
  • Inside loop (I). Consists of a single state with a flat emission-probability distribution, paa = 0.05 for all amino acids.
  • Part of inside loops closest to the membrane boundary (i). Consists of 24 states. The emission probability for Lys and Arg pR = pK = pKR is a parameter optimized on the training data (see below), whereas paa = 0.05 for all other amino acids.
  • Membrane (M). Two identical modules consisting of 21 states each. In the first 20 states, a flat emission-probability distribution is used, i.e., paa = 0.05 for all amino acids. In the last state (marked in Fig. S1), Eq. 1 is used to predict the free energy of membrane insertion of the whole TM segment starting 20 residues before the current sequence position. This can be done without losing the Markov property, since there is only one state path leading from the first TM state to the last. The emission probability is then calculated as:
    equation image
    where ΔGapp is calculated from Eq. 1, ΔGshift is a parameter optimized on the training data (see below), R is the gas constant, and T is the temperature (300 K).

When sequence profiles are used as input, Eq 1 is modified as described for TopPredΔG above, and the parameter pKR is replaced by wKR, where the emission probability for the profile column is calculated as pcolumn = 0.05 + wKR·f(Arg+Lys), and f(Arg+Lys) is the frequency of Arg and Lys residues in the profile column.

SCAMPI does not use transition probabilities. Instead, possible state transitions are either simply allowed or disallowed and thus do not contribute to the overall score of the topology other than providing a definition of the underlying grammar. The Viterbi algorithm (20) is used to find the highest-scoring state path through the model, which then corresponds to the predicted topology. The model construction and all calculations were made by using the modhmm package (www.modhmm.org/).


The high-resolution benchmark set consists of 123 membrane-protein chains from 73 Protein Data Base (PDB) (21) structures. All structures from the OPM database (22) were homology-reduced at 40% sequence identity by using cd-hit (23), which resulted in 114 chains. Another nine chains from the MPTopo (24) database, which were not present in OPM and with a lower sequence identity than 40% compared both with each other and the 114 previously selected sequences, were added. In addition, a low-resolution set of 146 membrane protein chains was constructed by combining the datasets used by Viklund et al. (5) and Käll et al. (17), and homology reducing at 40% sequence identity internally and at 25% sequence identity against the high-resolution set. This way, no sequence in the training data had >25% sequence identity to any of the sequences in the test data. Sequence profiles were constructed by using one round of BLAST (25) with an E value cutoff of 10−5; cutoff values ranging from 10−3 to 10−20 gave almost identical results (data not shown). Sequences, sequence profiles, and predicted topologies for both the high- and low-resolution datasets are available at http://topcons.cbr.su.se/.

SCAMPI and TopPredΔG were trained on the low-resolution set and tested on both the high-resolution set (Table 1 and Table S1) and the low-resolution set (Table S1). Because many of the methods, including SCAMPI and TopPredΔG, have been trained on all or a large part of the low-resolution set, the results from the low-resolution test should be interpreted with caution.

By using the high-resolution benchmark set, SCAMPI and TopPredΔG were compared against 10 other frequently used topology prediction methods: HMMTOP (4), MEMSAT (26), MEMSAT3 (6), Phobius (17), PolyPhobius (16), PRO-TMHMM (5), PRODIV-TMHMM (5), TMHMM2.0 (3), and TopPred II (14).

In the discrimination test (Table 2), the high-resolution benchmark set was used, together with 566 secreted and 494 cytoplasmic mammalian proteins from SwissProt (27) version 54.3, homology-reduced at 40% sequence identity by using cd-hit (23).

Parameter Optimization and Cross-Validation.

Because only two parameters need to be optimized for each method, it was computationally possible to perform a grid search for each parameter. The optimal values, ranges searched, and step sizes for each parameter are given in Table 3. If more than one set of parameter values gave rise to the same number of correct topologies, the one with better balance between under- and overpredictions was used. As a measure of the performance of the methods in Table 1, the percentage of correctly predicted topologies as defined by Krogh et al. (3) was used.

Table 3.
Model parameters

Surface-Accessibility Calculations.

Surface-accessibility calculations were performed by using NACCESS (www.bioinf.manchester.ac.uk/naccess/). The fraction buried surface area was calculated as:

equation image

where areaprotein is the accessible surface area of the helix in the presence of the molecule or chain, and areaisolated is the accessible surface area of the TM helix in isolation.

Supplementary Material

Supporting Information:


This work was supported by grants from the Swedish Research Council, the Swedish Foundation for Strategic Research, and the EU 6th Framework Program (Embrace, Contract LSHG-CT-2004-512092 and Biosapiens, Contract LSHG-CT-2004-512092).


The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/cgi/content/full/0711151105/DCSupplemental.


1. Elofsson A, von Heijne G. Membrane protein structure: Prediction versus reality. Annu Rev Biochem. 2007;76:125–140. [PubMed]
2. Kyte J, Doolittle RF. A simple method for displaying the hydropathic character of a protein. J Mol Biol. 1982;157:105–132. [PubMed]
3. Krogh A, Larsson B, von Heijne G, Sonnhammer EL. Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes. J Mol Biol. 2001;305:567–580. [PubMed]
4. Tusnady GE, Simon I. The HMMTOP transmembrane topology prediction server. Bioinformatics. 2001;17:849–850. [PubMed]
5. Viklund H, Elofsson A. Best alpha-helical transmembrane protein topology predictions are achieved using hidden Markov models and evolutionary information. Protein Sci. 2004;13:1908–1917. [PMC free article] [PubMed]
6. Jones DT. Improving the accuracy of transmembrane protein topology prediction using evolutionary information. Bioinformatics. 2007;23:538–544. [PubMed]
7. Rapoport TA, Goder V, Heinrich SU, Matlack KE. Membrane-protein integration and the role of the translocation channel. Trends Cell Biol. 2004;14:568–575. [PubMed]
8. Alder NN, Johnson AE. Cotranslational membrane protein biogenesis at the endoplasmic reticulum. J Biol Chem. 2004;279:22787–22790. [PubMed]
9. White SH. Membrane protein insertion: The biology–physics nexus. J Gen Physiol. 2007;129:363–369. [PMC free article] [PubMed]
10. White SH, von Heijne G. Transmembrane helices before, during, and after insertion. Curr Opin Struct Biol. 2005;15:378–386. [PubMed]
11. White SH, Ladokhin AS, Jayasinghe S, Hristova K. How membranes shape protein structure. J Biol Chem. 2001;276:32395–32398. [PubMed]
12. White SH, Wimley WC. Membrane protein folding and stability: Physical principles. Annu Rev Biophys Biomol Struct. 1999;28:319–365. [PubMed]
13. Hessa T, et al. Molecular code for transmembrane-helix recognition by the Sec61 translocon. Nature. 2007;450:1026–1030. [PubMed]
14. von Heijne G. Membrane protein structure prediction. Hydrophobicity analysis and the positive-inside rule. J Mol Biol. 1992;225:487–494. [PubMed]
15. Eisenberg D, Weiss RM, Terwilliger TC. The helical hydrophobic moment: A measure of the amphiphilicity of a helix. Nature. 1982;299:371–374. [PubMed]
16. Käll L, Krogh A, Sonnhammer EL. An HMM posterior decoder for sequence feature prediction that includes homology information. Bioinformatics. 2005;1(21) Suppl:i251–i257. [PubMed]
17. Käll L, Krogh A, Sonnhammer EL. A combined transmembrane topology and signal peptide prediction method. J Mol Biol. 2004;338:1027–1036. [PubMed]
18. Viklund H, Granseth E, Elofsson A. Structural classification and prediction of reentrant regions in alpha-helical transmembrane proteins: Application to complete genomes. J Mol Biol. 2006;361:591–603. [PubMed]
19. Buck TM, Wagner J, Grund S, Skach WR. A novel tripartite motif involved in aquaporin topogenesis, monomer folding and tetramerization. Nat Struct Mol Biol. 2007;14:762–769. [PubMed]
20. Viterbi AJ. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE T Inform Theory. 1967;13:260–269.
21. Berman HM, et al. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–242. [PMC free article] [PubMed]
22. Lomize MA, Lomize AL, Pogozheva ID, Mosberg HI. OPM: Orientations of Proteins in Membranes database. Bioinformatics. 2006;22:623–625. [PubMed]
23. Li W, Godzik A. Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22:1658–1659. [PubMed]
24. Jayasinghe S, Hristova K, White SH. MPtopo: A database of membrane protein topology. Protein Sci. 2001;10:455–458. [PMC free article] [PubMed]
25. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. [PubMed]
26. Jones DT, Taylor WR, Thornton JM. A model recognition approach to the prediction of all-helical membrane protein structure and topology. Biochemistry. 1994;33:3038–3049. [PubMed]
27. The Universal Protein Resource (UniProt) Nucleic Acids Res. 2007;35:D193–D197. [PMC free article] [PubMed]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences
PubReader format: click here to try


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • PubMed
    PubMed citations for these articles
  • Substance
    PubChem Substance links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...