• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of pnasPNASInfo for AuthorsSubscriptionsAboutThis Article
Proc Natl Acad Sci U S A. Jul 17, 2007; 104(29): 12052–12056.
Published online Jul 9, 2007. doi:  10.1073/pnas.0701315104
PMCID: PMC1924584
Evolution

Temporal patterns of genes in scientific publications

Abstract

Publications in scientific journals contain a considerable fraction of our scientific knowledge. Analyzing data from publication databases helps us understand how this knowledge is obtained and how it changes over time. In this study, we present a mathematical model for the temporal dynamics of data on the scientific content of publications. Our data set consists of references to thousands of genes in the >15 million publications listed in PubMed. We show that the observed dynamics may result from a simple process: Researchers predominantly publish on genes that already appear in many publications. This might be a rewarding strategy for researchers, because there is a positive correlation between the frequency of a gene in scientific publications and the journal impact of the publications. By comparing the empirical data with model predictions, we are able to detect unusual publication patterns that often correspond to major achievements in the field. We identify interactions between yeast genes from PubMed and show that the frequency differences of genes in publications lead to a biased picture of the resulting interaction network.

Keywords: cultural information, life sciences, mathematical model, text mining

Cultural information, including scientific knowledge, changes over time. It has been argued that the dynamics of this change resemble an evolutionary process (17). Philosophers have developed “evolutionary epistemologies” to describe the dynamics of changes in scientific theories (2), whereas biologists such as Dawkins introduced the concept of “memes” to highlight the analogies in the evolution of biological and cultural information (3, 6, 7). Although the concept of “memes” has been criticized for being overly simplistic (810), and it is unclear how far-reaching the analogies between cultural and biological information are, an evolutionary framework for the dynamics of cultural information remains, in our view, highly plausible.

A quantitative analysis of empirical data is an essential step toward a more detailed understanding of the processes that change cultural information. For scientific knowledge, excellent time-resolved data are available from publication databases such as PubMed (www.ncbi.nlm.nih.gov/entrez). Several studies use these data mainly to analyze properties of citation and collaboration networks (1117). Here, we present an analysis of the dynamics of content-related terms in the literature. We analyze data on the temporal patterns of thousands of genes in the titles and abstracts of scientific publications. This approach allows us to follow the dynamics of scientific progress in genetics and related research fields based on a large set of well defined and content-related terms.

To identify gene names, including synonyms and names of gene products in abstracts and titles of scientific publications listed in PubMed, we use the iHOP text-mining system (18, 19). PubMed is the major public database for publications in the life sciences. Depending on the species, the average recall of gene quotations in iHOP is ≈80%, and average precision is ≈95% (19). For some species such as yeast (Saccharomyces cerevisiae), these values reach 83% and 99%, respectively. A more detailed description has been given earlier and is also available on the iHOP web site (www.ihop-net.org/UniPub/iHOP/info/gene_index/index.html). For yeast, there are >4,000 genes that in total appear ≈80,000 times in ≈35,000 publications during the time span covered by PubMed (1975–2006). The total number of references to yeast genes in the literature shows an approximately exponential growth phase until the mid-1990s, followed by a phase of saturation (Fig. 1). The frequency distribution of genes in scientific publications resembles a power law (20). A few highly popular genes dominate the literature. Similar distributions have been observed in other types of bibliometric data sets, such as the frequency distribution of citations of scientific publications, where a few top papers attract most of the references (1113).

Fig. 1.
Temporal patterns of publications on yeast genes. (A) Number of publications per year of the most popular gene (ACT1) and a typical gene (PET54). There is a high level of stochasticity in the temporal pattern, and there are large differences in the publication ...

Throughout this manuscript, we refer to the frequency of a gene in PubMed titles and abstracts as its popularity. The most popular yeast gene, ACT1 (coding for actin), appears in >1,400 publications, whereas there are ≈650 genes that appear only once and ≈2,000 genes that never appear in the title or abstract of a publication. These remarkable frequency differences may reflect differences in the importance of genes. In line with this reasoning, the most-studied human genes, such as CD4 and p53, are related to human diseases and thus are of high societal relevance. For yeast genes, however, it is less clear how societal relevance can be determined, and so far no relation has been observed between the popularity of a gene and its importance to cellular processes (20). This indicates that the emergence of highly popular genes is not necessarily driven by importance alone but also by other mechanisms in the praxis of scientific research, such as conventions and trends.

In the following, we develop a stochastic mathematical model to describe the dynamics of references to genes in the literature. To analyze whether the emergence of highly popular genes might in principle be driven by research trends rather than differences in the importance of genes, we assume that all genes of a species have the same properties in their publication dynamics. Two processes are assumed to contribute to the growth of literature on the genes of a species. First, publications on a specific gene generate additional publications on this gene. Second, publications on one gene generate publications on other genes. We assume linear kinetics for both processes. A third process is required to initiate growth. We assume that publications may appear “spontaneously” at a constant rate that does not depend on previous research. The expected number of novel publications left angle bracketΔPi,t + 1right angle bracket = left angle bracketPi,t + 1Pi,tright angle bracket on gene i at time point (year) t + 1 is given by:

equation image

Pi,t and Pt* denote the total number of publications on gene i and the average number of publications over all genes of a species, appearing until time point t. Parameters k1 and k2 describe the rates at which a publication on a gene gives rise to further publications on other genes and the same gene, respectively. Parameter k3 describes the rate of initial spontaneous publications. The parameters k1, k2, and k3 are assumed to be the same for all genes of a species. Thus a model with only three parameters is used to describe the dynamics of publications on several thousands of genes of a species. The model is related to “preferential attachment” and similar processes (21, 22). After initial linear growth at rate k3, the average number of publications per gene grows approximately exponentially at rate k1 + k2. Depending on parameters k1 and k2, the resulting frequency distribution is characterized by few (k2 [double less-than sign] k1) or many (k1 [double less-than sign] k2) highly popular genes.

To describe the saturation phase of the last 10 years, we introduce the term 1/[1+(Pt*/PS)α]. The parameters PS and α determine after how many publications per gene and how abruptly saturation takes effect. The resulting dynamics is described by

equation image

Results

We use maximum-likelihood estimation to determine the parameters k1, k2, k3, PS, and α from the data and perform a bootstrap analysis to determine 96% confidence intervals. Details are given in Materials and Methods. The estimated rates for exponential growth are k1 = 0.028 (0.024 < k1 < 0.034), k2 = 0.2 (0.18 < k2 < 0.22), and k3 = 0.005 (0.003 < k3 < 0.007). Saturation is described by PS = 8.2 (6.4 < PS < 9.8) and α = 1.2 (1.0 < α < 1.6). The original data and a sample set of simulated data (see Materials and Methods) using the estimated parameters are highly similar (Fig. 1). An initial growth of k3 = 0.005 indicates that ≈30 genes per year appear spontaneously in the literature. Before saturation takes effect, the average number of gene names in titles and abstracts of publications grows exponentially at a rate of k1 + k2 ≈ 0.23. Most of this growth (≈87%) is driven by the mechanism that publications on a specific gene promote further publications on the same gene. Only a small proportion of growth (≈13%) is driven by the alternative mechanism that publications on genes promote further publications on different genes. The estimated rates k1, k2, and k3 are similar to those estimated for the model of exponential growth (Eq. 1) for 1975–1994 [see supporting information (SI) Fig. 4]. The model captures not only the overall growth of publications on yeast genes but also the evolution of the frequency distribution (SI Fig. 4C). Thus for a time range of 20 years, a very simple model with only three parameters gives an excellent description for the temporal patterns of >6,000 gene names in the literature.

Our results indicate that a frequency distribution with highly popular genes may, in principle, emerge, even if there are no differences in the importance of genes. However, they do not imply there are no differences in importance. Models that account for differences in importance may describe the dynamics similarly well. In practice, it is often impossible to determine whether importance plays a role. In our model, for example, the popularity of a gene depends on the time of its first appearance in the literature. The earlier a gene appears, the higher the expected frequency at a future point in time. However, it is difficult to determine whether the order at which genes appear in the literature is random, as in our model, or whether it is driven by importance. Furthermore, besides importance, there might be other factors that favor publications on specific genes. Some genes may be more easily accessible for scientific studies for methodological or experimental reasons. Particularly in early studies, some genes might be favored, because they are consistently expressed at a high level, or because mutations result in a distinctive phenotype. Prior knowledge, for example from studies in other species, may additionally increase the attractiveness of some genes for scientific research. Furthermore, even if trends and conventions play a role in the dynamics of science, this is not necessarily negative. It might be convenient and scientifically justified to always use the same genes or gene products as controls in assays or to use specific selective markers.

The strength of our model is its simplicity. Although scientific research is a highly complex process, our results show that a very simple model can predict frequency patterns of content-related terms, such as genes, in scientific publications. Our model does not rely on quantities such as importance, which are difficult or impossible to quantify in an objective way. Irrespective of the role of importance, our results indicate that the temporal evolution of publications on yeast genes follows a very simple dynamics: New publications are about genes that have been studied frequently in the past. Researchers predominantly publish on popular genes, although they may not necessarily be aware of this.

There is a highly significant positive correlation between the frequency at which a gene appears in the literature and the current impact of the journals in which it appears (Kendall's τ = 0.13, P < 2.2 × 10−16, n = 4,095). In other words, publications on popular genes appear in journals with higher impact (Fig. 2A). However, the temporal patterns of impact (Fig. 2B) reveal that, as the field progresses toward saturation, the reward for publishing on popular yeast genes decreases, whereas the reward for publishing on genes that have rarely or never been studied increases. By the end of the 1990s, the impact difference between publications on popular and unpopular genes disappeared. A potential explanation for this finding is that publications on popular genes face increasing competition for high-impact journals, whereas unstudied genes receive increasing interest. The positive correlation between popular journal impact might result from two different effects: If the first publications on a gene appear in a high-impact journal, it may become more attractive and thus more popular in the future. On the other hand, current popularity may have an influence on the impact of future publications. The latter mechanism would imply that at least until the mid-1990s, publishing on popular genes was a rewarding strategy for researchers.

Fig. 2.
Impact vs. publication frequencies. (A) There is a positive correlation between the frequency at which a gene appears in the literature and the average impact of the journals in which articles on a gene are published. More popular genes are published ...

To study whether our results described above are specific to S. cerevisiae, or whether they also apply to the publication dynamics of genes of other species, we performed a similar analysis for Drosophila melanogaster, Caenorhabditis elegans, and Homo sapiens. Results are shown in SI Fig. 5. For all species, we observe that k2 is much larger than k1, indicating that the growth of the research fields is mainly driven by the growth of research on the popular genes. The publication dynamics of Drosophila and C. elegans genes is very similar to that of yeast genes. Human genes follow a different dynamics. Most importantly, for human genes, our model cannot fully recapture the frequencies of the most popular genes. The assumption that all genes are of equal importance appears to be particularly unjustified for human genes: As mentioned above, the most frequent human genes are disease-related and therefore of high societal relevance. For the most popular genes, such differences in importance translate into an even higher popularity than can be explained by preferential attachment alone. As for S. cerevisiae, we observe a positive correlation between journal impact and popularity for C. elegans and Drosophila genes. Again, human genes differ from genes of other species, in that there is a negative correlation between journal impact and popularity. Given that there are many more publications that contain human genes in their titles or abstracts than there are for other species, it seems plausible that competition for limited space in high-impact journals plays a much larger role.

The model presented above (Eq. 2) is based on the assumption that all genes are equivalent in terms of their publication dynamics. For S. cerevisiae, Drosophila, and C. elegans, this allows the generation of patterns similar to the observed data. However, as discussed above, this does not necessarily imply there are no differences in the importance of genes in these species. Given the good fit, we can use discrepancies between model and data to detect such differences. More specifically, we can test whether a gene appeared at a specific time point in significantly more publications than would be expected based on the model. The event with the most significant deviation from expectation is the appearance of ACT1 in seven publications in 1980. This unexpected burst of publications is related to the sequencing of the yeast actin gene. A list of additional significant publication events is given in SI Table 1. Details on the methods are given in Materials and Methods.

The iHOP text-mining system also allows us to identify interactions among genes described in titles or abstracts from the PubMed database (19). More specifically, iHOP recognizes sentences of the pattern “gene/protein–verb–gene/protein,” where the verb indicates a physical interaction, such as “bind” or “interact.” For yeast, the resulting interaction network contains ≈6,500 unique interactions. The connectivity distribution of this network is dominated by a few highly connected genes (Fig. 3A). This distribution mainly results from the differences in the popularity of genes; there is a highly significant correlation between the frequency of a gene in the literature and the number of unique interactions obtained from the literature (Kendall's τ = 0.6, P < 2.2 × 10−16, n = 4,171). Although identifying a large number of interaction partners requires that a gene be well studied, it is, in our view, surprising that over 3 orders of magnitude, the frequency at which a gene appears in the literature is a strong predictor of the number of unique interactions reported for this gene.

Fig. 3.
The impact of popularity on the gene interaction network as obtained from PubMed data. (A) The double-logarithmic plot of the connectivity distribution indicates the presence of “hubs” with a large number of interaction partners reported ...

There are three hypotheses to explain this observation. First, genes with many interaction partners tend to become more popular. Second, most genes have a large number of interaction partners, and the more research is done, the more interaction partners are identified. Third, the more popular a gene is, the more false positives are published. All three hypotheses have interesting consequences. The first hypothesis, in our view, is the most implausible; when a gene initially attracts attention in the research community, it is not known how many interaction partners it has. However, early identification of many interaction partners might make a gene more interesting for the research community and may lead to additional publications. The second hypothesis is more plausible. More research on a gene is expected to lead to the identification of more interaction partners. Given that the correlation between popularity and number of interaction partners holds over several orders of magnitude, the second hypothesis implies that for any gene, an arbitrarily high number of interaction partners can be identified, if enough research is performed. This is in contrast to the prevailing view that the number of interactions of relevance for the functioning of a biological system is not arbitrarily high. The third hypothesis implies that interactions of more popular genes tend to be less reliable, i.e., the fraction of false findings among published interactions increases with increasing popularity. This is in line with theoretical predictions on the reliability of published research as outlined in a recent controversial study by J. P. A. Ioannidis (23). In contrast to the first hypothesis, the second and third hypotheses imply that hubs in the literature are not necessarily hubs in the underlying biochemical networks. Thus, they question the common view that biochemical networks are “scale-free”, which implies the presence of hubs. At present, we cannot distinguish among the three hypotheses. Given the importance of this point for the research community, this remains a highly interesting question for future studies.

Hubs have been observed in interaction networks derived from high-throughput methods (2427). Although in contrast to the literature network, these networks are not affected by a “popularity bias,” they also do not give an unbiased picture of the underlying biochemical network. Factors such as expression level or function generate a bias on the number of observed interaction partners (28). Given that the overlap between different high-throughput methods is relatively small (28), and the correlations between the connectivity of a gene as obtained by one vs. another unrelated method are weak (29), it is questionable whether high-throughput methods give a reliable picture for the presence of hubs in the underlying biochemical network. Our findings allow correction of the literature network for the “popularity bias” and illustrate that, when studying statistical properties of networks, it is essential to correct for biases that may arise from the methods used to generate the network. Results from high-throughput mass spectroscopy or tandem affinity purification, for example, should be corrected for expression levels. This is not always done; research appears to be influenced by the high popularity of “small-world” networks.

Discussion

It has been recognized that sociological processes such as trends and conventions play a role in the progression of science (1). Our approach of using a large-scale data set on the dynamics of content-related terms in the scientific literature is a first step to quantitatively describing the mechanisms involved in this process. We show that researchers predominately publish on genes that already appear at high frequency in the literature. This process leads to a frequency distribution of genes in scientific publications that resembles a power law. It has been argued that a similar process contributes to the frequency distribution of citations of papers: researchers predominately cite papers cited by other papers, partially because they search the literature recursively, and because they copy references from other papers (14, 16, 30). As for the genes in our analysis, the popularity of research papers is driven not only by importance but also by social processes.

Given that journal impact is often used to evaluate researchers, the positive correlation between the popularity of a gene and the journal impact of its publications may indicate that publishing on more popular genes is a rewarding strategy. However, there may also be strategic disadvantages associated with performing research on popular genes. The chance that competitors perform research on the same question can be expected to be greater for popular than for unpopular genes. Competition for the limited space in high-impact journals might be stronger, because for more popular genes, a larger number of publications on similar questions might be submitted. Furthermore, it might be more difficult to convince reviewers that a contribution on a popular gene adds sufficient novel findings to the existing body of knowledge. In contrast, research on novel or unpopular genes represents a strategy with higher risks but also higher potential outcome. If successful, research on novel genes might be perceived as important pioneer work. The optimal strategy of a researcher may depend on his/her career stage. It might be a safe strategy for a young researcher to work with established scientists on established topics. On the other hand, at some point, it is important for a researcher to be perceived as independent and to associate his/her name with a novel research topic. Furthermore, our results indicate that also the stage of the research field influences the success of a research strategy. Novel research topics seem particularly advantageous in the phase of saturation.

It is unclear whether researchers are able to determine optimal research strategies, and whether they indeed choose their research topics accordingly. Even if researchers use strategies that are optimized under the costs and benefits described above, it is questionable whether their behavior optimizes the way knowledge is established. As illustrated for the literature interaction network, differences in popularity may translate into potentially problematic biases in the research field. Furthermore, researchers may be under pressure to popularize their findings at least within the research community, which may facilitate overinterpretation of results. It therefore remains a very challenging task to make sure that the interests of individual researchers are not at odds with the interests of the research community.

Materials and Methods

In the following, we describe the procedures for estimating parameters from the data, simulating the process, and determining unexpected publication events.

The expected number of publications on gene i at time t, left angle bracketΔpi,tright angle bracket, is given by Eq. 2. The observed number of publications on gene i at time t is denoted by Δpi,t. We assume that the number of observed publications follows a Poisson distribution given by f(λ; n) = eλ λn/n!. The likelihood L(k1, k2, k3, PS, α) of the data is given by the product L(k1, k2, k3, PS, α) = [product] f(λ = left angle bracketΔpi,tright angle bracket; n = Δpi,t) over all genes i = 0… N and all time points t = t0tmax. We assume there are 6,200 yeast genes and thereby account for ≈2,000 genes that have never appeared in a title or an abstract. We use annual data as obtained from PubMed. Because the entire publication history is required, in principle, for calculating the expected number of publications(see Eq. 2), we use the first 4 years (1975–1978) as an approximation of the initial publication history and maximize the likelihood L over the time span from t0 = 1979 to tmax = 2005. We furthermore exclude all genes that appear >20 times in the first 4 years, because these genes likely have a considerable pre-1975 history. (For yeast, this applies to five genes: PGK1, ADH1, COB, HXK1, and PFK1. All these genes code for major enzymes in yeast metabolism, a topic that has a considerably longer history that yeast genetics.) To determine the maximum-likelihood estimators of the parameters, we numerically maximize log(L(k1, k2, k3, PS, α)) using the R function optim. To estimate confidence intervals, we generate 250 data sets by sampling (with replacement) a number of genes from the original data set and reestimate the parameters as described above.

We simulate the data set by subsequently calculating the expected number of publications left angle bracketΔpi,tright angle bracket for each gene at each year (Eq. 2) and then generate Δpi,t by drawing a random number from a Poisson distribution with λ = left angle bracketΔpi,tright angle bracket. Again, we use the first 4 years as input and simulate the process for the time span from 1979 to 2005.

To test whether a gene at a specific year appears at a significantly higher frequency than expected based on the model (Eq. 2), we calculate the probability P that for an expected number of publications left angle bracketpi,tright angle bracket, the number of publications is equal to or larger than the observed number of publications pi,t. The list of all events with P < 0.05/nT is given in SI Table 1. We perform tests only for the time span from 1979 to 2005. (nT denotes the number of tests we perform and is 167,400 = 27 years × 6,200 genes. The P value given above corresponds to a conservative Bonferroni correction.) Calculations for the three-parameter model (SI Fig. 4) are done analogously, using Eq. 1 instead of Eq. 2.

Supplementary Material

Supporting Information:

Acknowledgments

We thank C. T. Bergstrom, B. Kerr, J. West, F. Taddei, and R. May for inspiring discussions and helpful comments. We gratefully acknowledge support from Society in Science/The Branco Weiss Fellowship.

Footnotes

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/cgi/content/full/0701315104/DC1.

References

1. Kuhn TS. The Structure of Scientific Revolutions. Chicago: Univ of Chicago Press; 1962.
2. Popper K. Objective Knowledge: An Evolutionary Approach. Oxford: Oxford Univ Press; 1972.
3. Dawkins R. The Selfish Gene. Oxford: Oxford Univ Press; 1976.
4. Cavalli-Sforza LL, Feldman M. Cultural Transmission and Evolution. Princeton, NJ: Princeton Univ Press; 1981.
5. Boyd R, Richerson PJ. Culture and the Evolutionary Process. Chicago: Univ of Chicago Press; 1988.
6. Blackmore S. The Meme Machine. Oxford: Oxford Univ Press; 1999.
7. Blackmore S. Sci Am. 2000;283:52–61. [PubMed]
8. Boyd R, Richerson PJ. Sci Am. 2000;283:70–71. [PubMed]
9. Dugatkin LA. Sci Am. 2000;283:67–70. [PubMed]
10. Plotkin H. Sci Am. 2000;283:72–73. [PubMed]
11. de Solla Price DJ. Science. 1965;149:510–515. [PubMed]
12. Garfield E. Science. 1972;178:471–479. [PubMed]
13. Newman ME. Proc Natl Acad Sci USA. 2001;98:404–409. [PMC free article] [PubMed]
14. Borner K, Maru JT, Goldstone RL. Proc Natl Acad Sci USA. 2004;101(Suppl 1):5266–5273. [PMC free article] [PubMed]
15. Guimera R, Uzzi B, Spiro J, Amaral LA. Science. 2005;308:697–702. [PMC free article] [PubMed]
16. Redner S. Phys Today. 2005;58:49–54.
17. Chen C. Proc Natl Acad Sci USA. 2004;101(Suppl 1):5303–5310. [PMC free article] [PubMed]
18. Hoffmann R, Valencia A. Nat Genet. 2004;36:664. [PubMed]
19. Hoffmann R, Valencia A. Bioinformatics. 2005;21(Suppl 2):ii252–ii258. [PubMed]
20. Hoffmann R, Valencia A. Trends Genet. 2003;19:79–81. [PubMed]
21. Barabasi AL, Albert R. Science. 1999;286:509–512. [PubMed]
22. Redner S. Phys A. 2002;306:402–411.
23. Ioannidis JPA. PLoS Med. 2005;2:e124. [PMC free article] [PubMed]
24. Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, et al. Nature. 2000;403:623–627. [PubMed]
25. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y. Proc Natl Acad Sci USA. 2001;98:4569–4574. [PMC free article] [PubMed]
26. Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, et al. Nature. 2002;415:141–147. [PubMed]
27. Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier K, et al. Nature. 2002;415:180–183. [PubMed]
28. von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P. Nature. 2002;417:399–403. [PubMed]
29. Hoffmann R, Valencia A. Trends Genet. 2003;19:681–683. [PubMed]
30. Simkin MV, Roychowdhury VP. Complex Syst. 2003;14:269–274.

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...