- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

# Growth of novel protein structural data

Contributed by Michael Levitt, December 29, 2006

.Author contributions: M.L. designed research, performed research, contributed new reagents/analytic tools, analyzed data, and wrote the paper.

Freely available online through the PNAS open access option.

## Abstract

Contrary to popular assumption, the rate of growth of structural data has slowed, and the Protein Data Bank (PDB) has not been growing exponentially since 1995. Reaching such a dramatic conclusion requires careful measurement of growth of novel structures, which can be achieved by clustering entry sequences, or by using a novel index to down-weight entries with a higher number of sequence neighbors. These measures agree, and growth rates are very similar for entire PDB files, clusters, and weighted chains. The overall sizes of Structural Classification of Proteins (SCOP) categories (number of families, superfamilies, and folds) appear to be directly proportional to the number of deposited PDB files. Using our weighted chain count, which is most correlated to the change in the size of each SCOP category in any time period, shows that the rate of increase of SCOP categories is actually slowing down. This enables the final size of each of these SCOP categories to be predicted without examining or comparing protein structures. In the last 3 years, structures solved by structural genomics (SG) initiatives, especially the United States National Institutes of Health Protein Structure Initiative, have begun to redress the slowing growth of the PDB. Structures solved by SG are 3.8 times less sequence-redundant than typical PDB structures. Since mid-2004, SG programs have contributed half the novel structures measured by weighted chain counts. Our analysis does not rely on visual inspection of coordinate sets: it is done automatically, providing an accurate, up-to-date measure of the growth of novel protein structural data.

**Keywords:**number of folds, Protein Data Bank, structural genomics

The publication of the structure of the globular protein myoglobin by Kendrew (1) in 1960 electrified the scientific community by providing the first glimpse of the clockwork intricacy of nature's nanoscale structural machinery. For me, the beauty of this first structure, with its 2,600 atoms precisely arranged in three dimensions, was even more dramatic as a colored painting by Irving Geis that appeared on the December 1961 cover of Scientific American (2). In the 15 years from 1960 to 1974, 10 additional protein structures were solved and by 1976 it was possible to use the 31 known structures to define a classification scheme (3) still valid today (4). In the almost 50 years since these modest beginnings, the number of solved protein structures has increased steadily; today an average of 11 structure files are deposited into the Protein Data Bank (PDB) (5, 6) every day.

In the past, structural data have grown as protein structures were solved to answer key biological questions. The value of the structures outside their biological context was increasingly appreciated thanks to theoretical work that showed how a known protein structure can be used to model the structure of a protein with a closely related sequence (7, 8). This field of homology modeling is now a major preoccupation of human modelers and automatic modeling servers alike (9, 10). After Chothia (11) hypothesized that the number of different protein shapes was finite and perhaps as small as 1,000, it seemed feasible to determine structures of representative proteins and to then derive most other structures by homology modeling (12). This was the basis of the Protein Structure Initiative started by the United States National Institutes of Health in 1999 (13). Similar initiatives were started elsewhere, especially at Riken in Japan (14) and SPINE in Europe (15), to give a worldwide effort in a new field known as structural genomics or SG (the SPINE initiative does not aim to extend coverage of structure space).

Despite considerable investment in new methods for solving protein structures, little attention was given to the need to track the progress of these initiatives until the Chandonia and Brenner article at the beginning of 2006 (16). In that article, three different measures are used to measure the novelty of structures: (*i*) the number of structures that are sequence-unique at different levels similarity as measured by sequence identity or other more sophisticated scores, (*ii*) the number of matches to a sequence in a PFAM family that had no previous members of known structure, and (*iii*) the number of new Structural Classification of Proteins (SCOP) folds. Use of multiple measures is problematic because the overall novelty score will depend on how the measures are weighted. Use of matches to SCOP depends on manual curation, which is generally not up-to-date (SCOP Version 1.67 used by Chandonia and Brenner did not include PDB files released after May 15, 2004).

Here, we present a robust and reliable statistical method that quantifies novel protein structural data accurately. It uses sequence alone and does not require that structures be examined or compared. Account is taken of the sequence identity of multiple chains solved in the same PDB entry, sequence similarity of chains in different entries and the different number of residues in each entry. A weighted count method is introduced to eliminate sequence redundancy by down-weighting the contribution of sequences with more sequence neighbors already in the PDB. This measure is numerically very similar to the number of clusters found by hierarchical clustering. Although clustering is most like conventional classification schemes, the weighted count is much easier to compute and also more robust. Using this measure on data deposited at different times, we find that the change in the number of weighted chains at a 25% sequence identity threshold is the best predictor of the corresponding number of entries in the hand-curated SCOP protein classification (17); this allows prediction of the category sizes in SCOP well before the actual numbers are available.

The PDB is not growing exponentially (18, 19). Instead, the annual growth rate over the past 33 years fluctuates between 6% and 150% with three periods of >30% annual growth (October 1972 to July 1976, October 1980 to April 1982, and October 1988 to April 1994) followed by a steady decrease in growth rate since 1997. Growth rates are essentially the same for PDB entries, nonidentical chains, sequence clusters, and weighted counts.

## Results

### Independent Measures of Novel Structure Agree.

In this work, two methods are used to estimate novel structural data: the Cluster method, which clusters chains based on their level of sequence similarity to other chains; and the weighted count method, which reduces the weights of chains with more sequence neighbors. At a given threshold of sequence identity, each method uses the same set of links, but otherwise the methods are very different. Fig. 1 shows that when the sequence identity threshold (ID) values were varied from 25% to 100%, the cluster and weighted count methods give very similar values for the effective number of chains. There is no reason to expect N_{C}^{ID} and *N*_{W}^{ID} to be numerically similar, and the agreement we see here gives us greater confidence to use *N*_{W}^{ID}, a more robust measure that is also much easier to calculate. In clustering, links are always symmetrical in that if A links to B then B links to A. Because of differences in chain lengths, some links will be different from A to B than from B to A. Asymmetric links were left out of the clustering to give *N*_{C}^{ID}, which is much closer to *N*_{W}^{ID}.

*N*

_{C}

^{ID}(olive) and

*N*

_{Cas}

^{ID}(include asymmetrical links (light green), drops with %ID. The weighted chain count,

*N*

_{W}

^{ID}(magenta), has a very similar dependence on the %ID. All measures depend nonlinearly on the %ID but can be fitted

**...**

The best clustering of protein chains is that done in the SCOP database (17), which considers three main categories, families, superfamilies, and folds. Clustering is based on similarity of sequence, structure, function, and evolutionary history; it is done manually, mainly by Alexei Murzin. Clustering done here by sequence identity at a threshold of 25% is much less sensitive than that used in SCOP. One expects that all chains in clusters formed by sequence similarity would be in the same SCOP family (and hence in the same superfamily and fold). At ID = 25%, 99.0% of the 4,903 nonsingleton clusters are pure with all of the chains in a cluster belonging to the same SCOP family (for a chain containing more than one SCOP domain, any of the domains determines family membership).

### Growth of the PDB.

The rapid growth of protein structural data are clear from Fig. 2*a*. All measures of structure seem to agree on this log scale plot, and the curves are approximately straight lines, which would seem to imply exponential growth. The Dickerson equation (18), which predicted the number of PDB files to be (1/0.19) × exp(0.19(*t* − 1960), where *t* is the year, fits the real data very well for the 10 years from 1996 to 2006 but does much less well for the period from 1978 to 1995. Close examination of the changes with time of the percentage growth rates (Fig. 2*b*), which are also very similar for all measures, shows that the growth rates change dramatically with time. Were the growth of structural data to be exponential, the percentage growth rate would be constant. Instead, it shows three peaks. The first peak, occurring between 1972 and 1976 and including an additional 27 PDB files, corresponds to the initial explosion of crystallography that took place after the first structures were solved in England in the decade 1960 to 1970. The second peak is most likely due to the greater ease with which structures could be solved thanks to the spread of Digital Equipment Company's VAX 780 virtual memory computers first introduced in 1978. The third peak, occurring between 1991 and 1997 and including over 250 PDB files, corresponds to the availability of intense beams of x-rays from synchrotrons coupled with the use of crystals cooled in liquid nitrogen. Header records of these PDB files show that the frequency of use of crystal cooling rose rapidly between 1995 and 1997, settling down to ≈45% of solved PDB files. Use of synchrotron radiation rose slowly between 1995 and 2000; it is used for about half the structures solved since then.

*a*) Growth of protein structural data deposited in the PDB since its inception in 1972. The average growth rate is 28.0% per year for PDB files (

*N*

_{PDB}; gray) and 28.4% for nonidentical chains (

*N*

_{CHA}

**...**

In addition to these growth rate peaks, in the past decade there has been a steady drop in the rate of release of structural data, both total and novel. This growth rate is fitted well by the function *g*[*N*_{PDB}] = *a* + *b*/(*t* − *t*_{o}), where *a*, *b*, and *t*_{o} are constants determined by a least-squares fit to the observed *N*_{PDB} growth rate. Integration of this growth rate means that for the period from 1996 to 2006, the time dependence of the number of PDB file is a product of an exponential and a power-law, *N*_{PDB}(*t*) = *c* × exp(*a*(*t* − *t*_{o})) × (*t* − *t*_{o})^{b}, or more explicitly, *N*_{PDB}(*t*) = 21.9 × exp(0.062(*t* − 1987)) × (*t* − 1987)^{2.17}. There is a small increase in the rate of growth in the year to July 2004 that is most pronounced for structures from SG. This could be caused by a burst of PDB files released at the end of phase I of the National Institutes of Health Protein Structure Initiative followed by a hiatus as phase II gets underway. The novelty ratio, Δ*N*_{W}^{25}/Δ*N*_{CHA}, which measures structural novelty of solved protein structures, is remarkably constant with a value of 0.18 ± 0.03 since 1992 (Fig. 2*a*).

### Predicting SCOP Growth.

The linear dependence of SCOP category sizes on the number of PDB files shown in Fig. 3*a* seems to offer a way to predict these sizes. As a simple rule-of-thumb, since 1997, ≈10.7% of deposited PDB files correspond to new families, 5.1% correspond to new superfamilies, and 2.9% correspond to new folds. Such prediction is important as manual curation is very labor intensive, and the lag time between SCOP classification and deposition of PDB files can be significant: since the last release on 1 October 2004 (1.69), the number of deposited files has increased by ≈30%. Caution is required in predicting the size of SCOP categories from the number of deposited files: given a new PDB chain, there is no way one can tell whether it will start a new SCOP family before Alexei Murzin examines its sequence and structure. The probability that the new chain will be a new family is, therefore, random at 0.07 (*N*_{W}^{25}/*N*_{CHA} = 0.07). Given a new chain with a weight of 1 (no sequence neighbors at the 25% ID), one expects a higher probability that it will be the first member a new family. Thus, one expects the size of the categories to depend more on *N*_{W}^{25}, a measure of novel structural information.

*a*) Sizes of the SCOP categories (families in blue, superfamilies in green, and folds in red) vary linearly (correlation >0.998) with the number of files released

**...**

Can one prove whether *N*_{W}^{25} or *N*_{PDB} is a better predictor of the size of a SCOP category? We use the correlated count and correlated weight methods (see *Methods*). Odds are calculated for sequence-unique chains predicting new families (*O*_{FAM:W25}) and also for new families predicting new folds (*O*_{FOL:FAM}). In both cases, the overall odds are significantly favorable at *O*_{FAM:W25} = 5.0 ± 0.7 and *O*_{FOL:FAM} = 10.5 ± 2.1. This means that when a new chain is sequence-unique, it is five times more likely to be the first member of a new SCOP family than expected by chance (for PDB chains, which predict families at random, *O*_{FAM:CHA} = 1). The probability that a sequence-unique chain starts a new family is five times larger than random, but the value is still small at 0.35 (5 × 0.07). The *O*_{FAM:W25} values fluctuate with time but the overall curve is flat, indicating that the power of *N*_{W}^{25} to predict NFAM remains unchanged with time (SI Fig. 5). The results obtained with correlated weights also show the power of *N*_{W}^{25} to predict NFAM in that the correlation coefficient between the weight of each added chain and whether it starts a new family is significant with an overall value of *C*_{FAM:W25} = 0.56 ± 0.07. Both the odds and correlation coefficients decrease as the sequence threshold used to calculate the sequence weight is increased: as the %ID is increased from 25 to 100, *O*_{FAM:W25} decreases from 5.0 to 1.8, whereas *C*_{FAM:W25} decreases from 0.57 to 0.28 (SI Fig. 6). This is expected because a method that detects remote homology better should be a better predictor of whether a sequence-unique chain starts a new family.

Knowing that *N*_{W}^{25} is the best predictor of SCOP category sizes, Fig. 3*b* plots the sizes of SCOP families, superfamilies, and folds against *N*_{W}^{25}. The curves are not linear and show saturation: as the amount of novel structural data grows, new categories are being formed more slowly. The sizes of the SCOP family, superfamily, and fold categories on August 20, 2006, the date of this analysis, are 3,757, 1,847, and 1,097. Use of saturating functions with more adjustable parameters had no affect on the fit or extrapolation.

### Contribution of SG.

Fig. 4*a* shows the growth in novel structural data since 2000. In the six and a half year period to August 2006, *N*_{W}^{25} has increased from 2,059 to 7,792, a factor of 3.8. If those structures that have been solved by the worldwide SG initiative are omitted, growth is from 2,044 to 5,899, a factor of 2.9. Without the data from SG, the growth in *N*_{W}^{25} is almost linear with time with 216 weighted chain counts added per year. The SCOP category sizes behave similarly (using the real data to October 2004 and predicted data based on the saturating functions in Fig. 3*a*, thereafter) with increases from 1,301 to 3,757, 817 to 1847, and 537 to 1,097 for families, superfamilies, and folds, respectively (factors of 2.9, 2.3, and 2.0). The percentage of the annually deposited novel structural data measured by *N*_{W}^{25} that comes from SG has risen steadily and since the beginning of 2005 is 50% (Fig. 4*b*, % yearly Δ*N*_{W}^{25} Only SG).

## Discussion

### Measuring and Classifying Novel Structural Data.

The weighted count method introduced here performs extremely well as a measure of novel structural data. It involves no adjustable parameters other than a match threshold and has no disadvantages relative to the much more computationally expensive and more arbitrary chain clustering method. Extensive searching indicates that the connection we find between the weighted count and number of clusters has not been observed before. The close relationship seen here may be a special property of protein sequences, and we plan further tests with data sets having a wide variety of linking patterns.

An important property that distinguishes the weighted chain method from clustering is that it can be used on a subset of the data. Using the weights of a small random subset, just 2% of the sequences gives estimates of *N*_{W}^{25} that are accurate to 5%. The time dependence of the estimated *N*_{W}^{25} is also accurate after year 1990, and better results can be obtained by averaging over five different random subsets. It would be interesting to calculate the value of *N*_{W}^{25} for all known sequences and use this to estimate the total number of SCOP families, superfamilies, and folds that would be found were all these sequenced proteins to have their structures solved. The current National Center for Biotechnology Information nonredundant database contains ≈3,000,000 sequences and a total number of residues that is ≈120 times larger than the corresponding database of unique chains from the PDB. Using FASTA on all these pairs would require 120^{2} more computer time than the 200 days used by the PDB all-vs.-all comparison; this is close to 3,000,000 days or 8,000 years on an Intel Xeon 2.8-GHz processor. Using a random subset of 1,000 query sequences would require 1,000/3,000,000 = 0.03% of the sequence comparisons, take 60 days, and give an estimate of *N*_{W}^{25} accurate to a few percent.

A potential deficiency of our study is that we use the polypeptide chain in a PDB file as the basic unit. SCOP uses domains that are found by examination of the structure; using their definitions would make objective comparison with SCOP impossible and would also mean we had to wait for the SCOP domain definitions. We can parse chains into domains using sequence alignment. If disjoint regions of a long chain are sequence-matched to other chains that do not show any match to one another, then the long chain can be split into domains. Preliminary tests of automatic splitting with a 40% sequence identity threshold give a total of 51,765 chains versus the original number of 44,220 chains and has a reassuringly small effect on the results presented here. Correct parsing of chains into domains is difficult (20–22).

As noted in *Results*, the rate of growth of SCOP categories with deposited novel structural data as measured by *N*_{W}^{25} is slowing down (the ratio *N*_{FAM}/*N*_{W}^{25} is decreasing in Fig. 3*b*). If a similar plot is drawn for the most recent CATH 3.0.0 classification (23), the effect is more marked, probably because many of the most recently deposited PDB files are not being classified by CATH. It is not clear whether this is a property of cluster-based classification in general or of the SCOP classification in particular. In a preliminary test, we find *N*_{C}^{25}/*N*_{C}^{50} decreases with time but by <1/3 of the decrease in *N*_{FAM}/*N*_{W}^{25}; a proper test requires a more sensitive method of detecting chain similarity.

The present method of detecting similarity by using pairwise sequence alignment is not sensitive enough. It would be preferable to use more sensitive matching methods that use multiple sequences (like PSI-BLAST; ref. 24). One could also match known structures using structural alignment (25). Parsing the chains into structural domains could be done as described above for sequence matching and augmented with automated domain finding programs (26).

It would seem that fitting the growth of the SCOP categories to the weighted chain count, *N*_{W}^{25} (Fig. 3*b*), allows one to estimate the total number of SCOP folds as *N*_{W}^{25} becomes very large. This is a question that has received a great deal of attention since Chothia (11) suggested that there are <1,000 protein folds in all biology. Follow-up work (27) showed that with better assumptions, the number must be substantially larger. Since then, the estimated number of folds has varied from 650 (28) to >10,000 (29), with many other estimates in between these extremes (30–37). From the saturating function fit in Fig. 3*b*, we find that the maximum value of *N*_{FOL} is 1613. Any extrapolation assumes that the selection of proteins for structure determination will continue as it was in the past, which is unlikely. In fact, the smooth dependence of the size of SCOP categories on the number of released PDB files (Fig. 3*a*) is surprising because the priorities for selecting proteins for structure determination have changed greatly over the past three decades.

### Estimating the Contribution of SG.

Protein SG centers began contributing structures to the PDB 10 years ago. Initial growth was very slow, and by the year 2000 only 36 out of 11,802 PDB files could be traced to SG (SI Table 1). Over the next six and a half years, an additional 3,134 SG files were deposited. More than half of this growth (1,918 PDB files) has occurred since October 1, 2004, the date of the most recent SCOP release (1.69). Clearly, one needs a better way to estimate the contribution of SG than waiting for SCOP. In their study, Chandonia and Brenner (16) estimated the contribution of SG by counting chains that are unique at a 30% identity threshold, by counting occurrences in SCOP 1.67 (dated 15 May 2004) and by counting the number of new PFAM families. They concluded that in the year 2005, SG centers had contributed about half the first structures in a protein family. In Fig. 4*b*, we show that the SG contribution to *N*_{W}^{25} over this same period is 50%. More importantly, we show that in the subsequent 18 months to August 20, 2006, this level remained steady. We estimate SG centers to have contributed half of the SCOP families, superfamilies, and folds in the two and a half years since January 1, 2004. It will be very interesting to see how these estimates compare with the real numbers that are likely to be available soon in the next release of SCOP. An earlier study by Todd *et al.* (38) used very time-consuming manual inspection of three-dimensional structures released and was only able to study proteins deposited into the PDB by July 31, 2003. From Fig. 4*b*, it is clear that by that date, the contribution of SG to novel structure was ≈25% or half its current value.

Is the increasing contribution of SG to novel protein structures causing a decrease in the novelty of non-SG structures? This is examined by looking at the ratio Δ*N*_{W}^{25}/Δ*N*_{CHA} for recently deposited non-SG structures (SI Table 1). From October 1, 2004 to August 22, 2006, *N*_{W}^{25} increased from 4,598 to 5,899, whereas *N*_{CHA} increased from 30,037 to 41,204 so that Δ*N*_{W}^{25}/Δ*N*_{CHA} = 0.117. For the period from January 1, 2000 to September 30, 2004, Δ*N*_{W}^{25}/Δ*N*_{CHA} = 0.140 so that non-SG structures are less novel than they used to be. The rate of growth of novel non-SG structure has been constant at 15% since 2000 but for the preceding 4 years it was higher at >20% (Fig. 2*b*). Thus, SG may have impacted negatively on conventional structure determination as measured by the novelty and the rate of growth of the structures released. It may also have allowed crystallographers to examine biologically interesting systems without regard to structural novelty. With a 3.8 times higher level of structural novelty and more rapid growth, SG has effectively doubled the rate of novel structure determination.

## Conclusions

The steady drop in the rate of determination of novel structures that began in 1995 was halted in 2003 in part due to structures solved by SG. Clearly, the very significant investment made worldwide in more efficient structure determination is essential if the growth in structural data are to continue at traditional rates. It is interesting that the protein structure initiative was first suggested in 1998 at a time when the fall in the rate of structure determination was already occurring but it would not have been discernable.

## Methods

### PDB Data Sets Used.

The analysis presented here starts by downloading the lists of all PDB entries (ftp://ftp.rcsb.org/pub/pdb/derived_data/index/entries.idx), the sequences of all PDB chains (ftp://ftp.rcsb.org/pub/pdb/derived_data/pdb_seqres.txt), and the listing of PDB entries associated with SG (http://targetdb.pdb.org/target_files/targets.xml.gz) (TargetDB; ref. 39).

At the time data were downloaded (August 20, 2006), there were 35,805 released PDB files for proteins, 44,374 nonidentical protein chains (we eliminate multiple copies of the same chain in a PDB file), and 3,102 PDB files from SG. For details on PDB file selection and deposit and release dates, see *SI Text*.

### SCOP Data Sets Used.

This work used the SCOP 1.69 classification dated July 2005 (http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.cla.scop.txt_1.69). Note that although this file is dated July 2005, it only includes the 24,037 PDB files with a release date earlier than October 1, 2004. We also used http://scop.mrc-lmb.cam.ac.uk/scop/count.html to give the history of the SCOP classification based on 13 SCOP releases from October 20, 1997.

### Sequence Matching.

The degree of sequence similarity of pairs of PDB entries varies greatly: some pairs of PDB files have identical sequences, whereas others show no significant similarity. All chains that are highly similar in sequence or structure can be considered to contribute redundant structural information to the database. Here, we focus on sequence similarity which is easier to calculate and can also be applied more widely than structural comparison (25). Sequence similarity is measured by comparing all 44,374 nonidentical chain sequences with one another using Pearson's FASTA program (40). Only matches with *e*-values below 10^{−2} were kept.

Sequence matching of chains described above has a serious problem in that FASTA is a local sequence alignment method. If sequence A can consist of two parts, A1 and A2, where A1 is identical to sequence B and A2 is identical to sequence C, the FASTA will find that A is identical to B and also identical to C but could find that B and C are totally dissimilar. The correct way to solve this problem is to parse all of the sequences and split sequence A into two chains A1 and A2. Such parsing is not trivial (20–22), and here we deal with the problem in a different way.

If the FASTA percent identity of a match between chain A and chain B is *p*, the aligned region consists of *N* residues and that the length of chain A is *L*_{A}, then the effective percent identity is calculated as *p*_{E} = *p* × (*N*/*L*_{A}). Note that *p*_{E} < *p* unless the entire length of chain A is aligned. The effective percent identity is now different for the link between chain A to chain B versus chain B to chain A. With this scheme, partial matching is heavily down-weighted. Thus, in the above example, if the regions *L*_{1} and *L*_{2} are the same lengths, then for the match of A to B *p*_{E} will be 50% rather than 100%, whereas the match of B to A it will be 100%.

### Links Between Structures Based on Sequence Identity.

Chains A and B aligned by FASTA are considered to be linked if the *p*_{E} value for A to B is above the sequence %ID, varied from 25% to 100%, in steps of 5%. Such links between protein chains can be used in two ways: (*i*) join sequence-related chains into clusters (cluster method), and (*ii*) weight each chain by its number of neighbors (weight method).

### Cluster Method.

In the cluster method, links are used to organize chains into clusters so that each cluster contains all of the chains that are linked to any member of the cluster (single linkage clustering). In this form of clustering, the distance between objects is either above or below the threshold: objects in the same cluster must be connected by a path of links and do not need to be linked directly. The number of clusters at each percentage identity cutoff, *N*_{C}^{ID}, is defined as the number of clusters. The count *N*_{C}^{ID} must include the singleton clusters, the chains that make no links. In the cluster method, the links are effectively symmetrical in that it does not matter whether A is linked to B or B is linked to A. Here, the effective percent identity between A and B, *p*_{E}^{AB}, may be very different from that between B and A, *p*_{E}^{BA}. Links for which *p*_{E}^{AB}, and *p*_{E}^{BA} differ by >20% points are termed asymmetric and are omitted in most of the clustering runs.

### Weighted Count Method.

In the weighted count method, chains that are related to other chains are down-weighted because they contribute less to the total body of novel structural data. In this method, each chain is given a weight, and these weights are summed to get a weighted number of chains. The links found by sequence matching and used above for clustering are also used to calculate the weight of chain *i* with *W _{i}* = 1/(

*n*

_{neib}+ 1), where

*n*

_{neib}is the number of neighboring chains linked to chain

*i*. This scheme has properties that are intuitively sensible: singleton chains have no neighbors (

*n*

_{neib}= 0), so their weight is given by

*W*

_{s}= 1; chains that are part of a completely connected cluster of size

*n*will each have a weight of 1/

*n*, and the total weight of the cluster will be

*W*

_{cc}=

*n*(1/

*n*) = 1. Thus, adding identical copies of a particular object has no effect on the weight of that class of objects. Once the weight of every chain is defined, the weighted number of chains for the particular sequence ID is taken as the sum of the chain weights,

*N*

_{W}

^{ID}= Σ

*W*

_{i}

^{ID}. Because each object is assigned a weight, it is easy to calculate the total number of weighted residues as

*M*

_{W}

^{ID}= Σ

*W*

_{i}

^{ID}

*m*, where

_{i}*m*is the chain length. It is also easy to calculate the total number of chains or residues belonging to any particular subset of PDB files (e.g., deposited before a certain date, solved by SG, solved in a certain geographical location, etc.). In the weight method, the links are asymmetrical in that the links to chain A are used to determine its weight, while different links are used to determine the weight of chain B; thus, all links are used.

_{i}### Measuring Growth of Structural Data.

Growth of structural data with time is found by considering what was in the PDB at two different times. This involves eliminating all structures deposited (or alternatively released) after a particular date to give a modified PDB. This set of structures is then analyzed by the cluster and weighted count methods using a sequence %ID that ranges from 25% to 100% to give the number of clusters, *N*_{C}^{ID}, and the weighted chain count, *N*_{W}^{ID}, at different dates. The number of deposited PDB files, *N*_{PDB}, and the number of nonidentical chains, *N*_{CHA}, are also recorded. To get the contribution of the subset of structures solved by SG, these PDB files are omitted and the analysis is repeated: the SG contribution is then the difference between the counts with and without the SG structures. The contributions of SG to SCOP are more problematic as a SCOP fold discovered first by SG and later discovered without it would not be recorded by the above omission method. Instead, we use all of the data and count the first member of each SCOP category for PDB entries from SG and not from SG.

Growth rates of these numbers are calculated for each quarter and smoothed to give the annual growth rate (see legend to SI Fig. 7 for details). Growth rates are always expressed as the change in number divided by the current number (simple percentage growth).

### Correlation to SCOP Category Size.

Measuring the correlation between the change of the size of a particular SCOP category and an easily measured quantity like the change in the number of PDB files (Δ*N*_{PDB}), the number of unique chains (Δ*N*_{CHA}), or the weighted chain count (Δ*N*_{W}^{25}) is more difficult than one might expect. As the SCOP categories increase with increases in Δ*N*_{PDB}, Δ*N*_{CHA}, and Δ*N*_{W}^{25}, they all seem to be highly correlated.

Chains are ordered by date of release, and each chain is marked with four 0s or 1s depending on whether it is the first member of a new family, a new superfamily, or a new fold, or is sequence-unique (no sequence neighbors at the particular threshold so that its weight is 1.0). In doing this, all of the SCOP domains that are contained in a particular chain are used. We then take nonoverlapping sets of 500 structures sorted by increasing deposit date and calculate the number of times that a sequence-unique chain starts a new SCOP family, superfamily, or fold. This number is normalized by the frequency that would be expected by chance to give the odds that a sequence-unique chain starts a new SCOP category more often than expected by chance. Specifically, the odds that a sequence-unique chain will be the first member of a new SCOP family is *O*_{PDB:FAM} = 500 × (*K*_{W25:FAM})/*K*_{W25} × *K*_{FAM}), where *K*_{W25} is the number of sequence-unique chains at 25% ID, *K*_{FAM} is the number of first members of new SCOP families, and *K*_{KW25:FAM} is the number of cases where sequence-unique chains are also members of a new family. This method is known here as the correlated count method.

Because there is no way to tell whether a new PDB or new chain is going to be a member of a new family without examining its sequence or structure, the odds that a new PDB file or a new chain will start a new SCOP family is 1.0 [on average, *K*_{PDB:FAM} = (*K*_{W25} × *K*_{FAM})/500]. Thus, if the corresponding odds for sequence-unique chains is >1, *K*_{W25} will be a better predictor of the number of SCOP categories. Rather than consider the sequence-unique chains with weight of 1.0, it is possible to use the actual weight of each structure at the time it is added to the PDB. Correlation coefficients can then be calculated between the weight of each structure and a value that is 0 or 1 depending on whether the structure was the first member of a new SCOP family, superfamily, or fold. This method is known here as the correlated weight method.

## Acknowledgments

I thank numerous colleagues for critical discussion and encouragement. All four referees were exceptionally perceptive, and their constructive criticism greatly improved this work. This work was supported by National Institutes of Health Grant GM63817.

## Abbreviations

- PDB
- Protein Data Bank
- SCOP
- Structural Classification of Proteins
- SG
- structural genomics
- ID
- identity threshold.

## Note.

SCOP 1.71 dated December 4, 2006, includes all PDF entries released by January 18, 2005. There are 1,626 additional PDB files, and 159, 50, and 26 additional SCOP families, superfamilies, and folds, respectively. From the dependence on *N*_{W}^{25} (Fig. 3*b*), we predict SCOP increases of 162, 62, and 33, respectively. Using the dependence on *N*_{PDB} would give less accurate SCOP predictions of 174, 83, and 47, respectively.

## Footnotes

The author declares no conflict of interest.

This article contains supporting information online at www.pnas.org/cgi/content/full/0611678104/DC1.

## References

**National Academy of Sciences**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (1.7M)

- Progress of structural genomics initiatives: an analysis of solved target structures.[J Mol Biol. 2005]
*Todd AE, Marsden RL, Thornton JM, Orengo CA.**J Mol Biol. 2005 May 20; 348(5):1235-60. Epub 2005 Apr 2.* - A 3D sequence-independent representation of the protein data bank.[Protein Eng. 1995]
*Fischer D, Tsai CJ, Nussinov R, Wolfson H.**Protein Eng. 1995 Oct; 8(10):981-97.* - Comparison of sequence and structure-based datasets for nonredundant structural data mining.[Proteins. 2005]
*Chu CK, Feng LL, Wouters MA.**Proteins. 2005 Sep 1; 60(4):577-83.* - Progress in protein structural class prediction and its impact to bioinformatics and proteomics.[Curr Protein Pept Sci. 2005]
*Chou KC.**Curr Protein Pept Sci. 2005 Oct; 6(5):423-36.* - Trendspotting in the Protein Data Bank.[FEBS Lett. 2013]
*Berman HM, Coimbatore Narayanan B, Di Costanzo L, Dutta S, Ghosh S, Hudson BP, Lawson CL, Peisach E, Prlić A, Rose PW, et al.**FEBS Lett. 2013 Apr 17; 587(8):1036-45. Epub 2013 Jan 18.*

- Growth of novel protein structural dataGrowth of novel protein structural dataProceedings of the National Academy of Sciences of the United States of America. Feb 27, 2007; 104(9)3183PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...