Are orphan genes protein-coding, prediction artifacts, or non-coding RNAs?

BMC Bioinformatics. 2016 May 31;17(1):226. doi: 10.1186/s12859-016-1102-x.

Abstract

Background: Current genome sequencing projects reveal substantial numbers of taxonomically restricted, so called orphan genes that lack homology with genes from other evolutionary lineages. However, it is not clear to what extent orphan genes are real, genomic artifacts, or represent non-coding RNAs.

Results: Here, we use a simple set of assumptions to test the nature of orphan genes. First, a sequence that is transcribed is considered a real biological entity. Second, every sequence that is supported by proteome data or shows a depletion of non-synonymous substitutions is a protein-coding gene. Using genomic, transcriptomic and proteomic data for the nematode Pristionchus pacificus, we show that between 4129-7997 (42-81 %) of predicted orphan genes are expressed and 3818-7545 (39-76 %) of orphan genes are under negative selection. In three cases that exhibited strong evolutionary constraint but lacked expression evidence in 14 RNA-seq samples, we could experimentally validate the predicted gene structures. Comparing different data sets to infer selection on orphan gene clusters, we find that the presence of a closely related genome provides the most powerful resource to robustly identify evidence of negative selection. However, even in the absence of other genomic data, the availability of paralogous sequences was enough to show negative selection in 8-10 % of orphan genes.

Conclusions: Our study shows that the great majority of previously identified orphan genes in P. pacificus are indeed protein-coding genes. Even though this work represents a case study on a single species, our approach can be transferred to genomic data of other non-model organisms in order to ascertain the protein-coding nature of orphan genes.

Keywords: Gene expression; Negative selection; Nematodes; Orphan genes; Ortholog; Paralog; dN/dS.

MeSH terms

  • Animals
  • Gene Expression
  • Humans
  • Multigene Family
  • Nematoda / genetics*
  • Proteomics / methods*
  • RNA, Untranslated / genetics*
  • Transcription, Genetic

Substances

  • RNA, Untranslated