NCBI Apis dorsata Annotation Release 100

The RefSeq genome records for Apis dorsata were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Apis dorsata Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Dec 20 2013
Date of submission of annotation to the public databases: Jan 8 2014
Software version: 5.2

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
Apis dorsata 1.3	GCF_000469605.1	Cold Spring Harbor Laboratory	09-24-2013	Reference	unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	Apis dorsata 1.3
Genes and pseudogenes	11,507
protein-coding	9,910
non-coding	1,468
pseudogenes	129
mRNAs	18,144
fully-supported	17,110
with > 5% ab initio	561
partial	318
known RefSeq (NM_)	0
model RefSeq (XM_)	18,144
model RefSeq (XM_) with correction	47
Other RNAs	2,557
fully-supported	2,414
with > 5% ab initio	0
partial	0
known RefSeq (NR_)	0
model RefSeq (XR_)	2,414
CDSs	18,144
fully-supported	17,110
with > 5% ab initio	627
partial	318
known RefSeq (NP_)	0
model RefSeq (XP_)	18,144
model RefSeq (XP_) with correction	47

Detailed reports

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	11,378	13,107	3,655	63	834,042
All transcripts	20,701	2,270	1,787	61	57,824
mRNA	18,144	2,436	1,929	67	57,824
misc_RNA	654	2,441	1,925	123	12,484
tRNA	143	74	72	61	84
lncRNA	1,760	672	543	84	5,614
Single-exon transcripts	255	1,367	1,063	67	7,092
coding transcripts (NM_/XM_ )	255	1,367	1,063	67	7,092
CDSs	18,144	1,892	1,395	67	57,621
Exons	87,149	291	200	1	11,843
in coding transcripts (NM_/XM_ )	81,824	294	201	1	11,843
in non-coding transcripts (NR_/XR_ )	8,048	244	172	1	4,210
Introns	73,216	2,238	157	30	523,322
in coding transcripts (NM_/XM_ )	69,779	2,108	148	30	523,322
in non-coding transcripts (NR_/XR_ )	6,079	3,733	381	30	452,868

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.81	1	1	42
Number of exons per transcript	8.14	6	1	183

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
Apis dorsata 1.3	GCF_000469605.1	6.41%	42.48%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with short reads and reported in the Short read transcript alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	37	37 (100.00%)	20 (54.05%)	98.65%	99.70%
Apis mellifera known RefSeq (NM_/NR_)	673	617 (91.68%)	321 (47.70%)	95.00%	96.71%
Apis mellifera Genbank	600	519 (86.50%)	293 (48.83%)	95.17%	96.55%
Apis mellifera EST	169,408	116,515 (68.78%)	99,916 (58.98%)	94.81%	97.11%

Short read transcript alignments

The following short reads (RNA-Seq) from the Short Read Archive were also used for gene prediction:

Hide alignments statistics

Sample Id	Track name	Number of reads	Number (%) of aligned reads	Number (%) spliced reads	Number of introns
All	multiple samples	6,030,884,356	3,662,768,983 (60.73%)	743,874,372 (12.33%)	152,398
SRS010495	queen ovary (Apis mellifera, SRS010495)	1,357,405	1,033,008 (76.10%)	641,682 (47.27%)	57,242
SRS114874	nurse brain #1 (Apis mellifera, SRS114874)	9,912,179	6,949,444 (70.11%)	1,091,459 (11.01%)	61,859
SRS121285	nurse brain #2 (Apis mellifera, SRS121285)	30,680,547	21,914,697 (71.43%)	3,340,338 (10.89%)	79,784
SRS121286	nurse brain #3 (Apis mellifera, SRS121286)	16,911,515	12,074,250 (71.40%)	1,842,527 (10.90%)	74,993
SRS121287	nurse brain #4 (Apis mellifera, SRS121287)	14,514,311	10,273,986 (70.79%)	1,655,926 (11.41%)	73,890
SRS121288	nurse brain #5 (Apis mellifera, SRS121288)	11,451,917	7,471,902 (65.25%)	1,239,198 (10.82%)	65,878
SRS121289	forager brain #1 (Apis mellifera, SRS121289)	15,738,118	10,876,404 (69.11%)	1,726,485 (10.97%)	69,784
SRS121290	forager brain #2 (Apis mellifera, SRS121290)	14,679,014	10,391,045 (70.79%)	1,704,323 (11.61%)	73,163
SRS121291	forager brain #3 (Apis mellifera, SRS121291)	13,917,711	9,585,802 (68.87%)	1,560,256 (11.21%)	67,774
SRS121292	forager brain #4 (Apis mellifera, SRS121292)	25,440,947	17,614,076 (69.24%)	2,856,092 (11.23%)	75,494
SRS121293	forager brain #5 (Apis mellifera, SRS121293)	28,566,976	19,765,682 (69.19%)	3,103,139 (10.86%)	77,575
SRS300641	reverted nurse brains pool #1 (Apis mellifera, SRS300641)	427,668,328	207,632,690 (48.55%)	41,287,356 (9.65%)	113,578
SRS300642	reverted nurse brains pool #2 (Apis mellifera, SRS300642)	590,830,686	359,784,542 (60.89%)	70,608,286 (11.95%)	125,060
SRS300643	reverted nurse brains pool #3 (Apis mellifera, SRS300643)	303,929,248	168,871,506 (55.56%)	32,888,007 (10.82%)	105,524
SRS300644	reverted nurse brains pool #4 (Apis mellifera, SRS300644)	308,963,852	157,795,950 (51.07%)	31,271,684 (10.12%)	105,883
SRS300645	reverted nurse brains pool #5 (Apis mellifera, SRS300645)	277,277,252	160,532,523 (57.90%)	31,389,098 (11.32%)	106,171
SRS300646	reverted nurse brains pool #6 (Apis mellifera, SRS300646)	451,987,942	265,440,313 (58.73%)	52,244,267 (11.56%)	111,740
SRS300647	forager brains pool #1 (Apis mellifera, SRS300647)	357,892,462	172,343,949 (48.16%)	34,184,488 (9.55%)	112,203
SRS300648	forager brains pool #2 (Apis mellifera, SRS300648)	386,101,840	228,095,154 (59.08%)	43,565,769 (11.28%)	116,849
SRS300649	forager brains pool #3 (Apis mellifera, SRS300649)	242,309,834	126,936,005 (52.39%)	22,775,181 (9.40%)	106,454
SRS300650	forager brains pool #4 (Apis mellifera, SRS300650)	329,168,752	174,406,384 (52.98%)	33,386,139 (10.14%)	110,949
SRS300651	forager brains pool #5 (Apis mellifera, SRS300651)	359,328,208	205,489,506 (57.19%)	38,113,898 (10.61%)	113,047
SRS300652	forager brains pool #6 (Apis mellifera, SRS300652)	278,812,238	146,231,179 (52.45%)	27,891,434 (10.00%)	109,283
SRS308273	forager brain (Apis mellifera, SRS308273)	116,791,866	52,599,276 (45.04%)	7,975,415 (6.83%)	83,453
SRS334276	generic sample (Apis mellifera, SRS334276)	380,046,570	265,626,402 (69.89%)	58,194,977 (15.31%)	106,521
SRS403330	Brain (Apis mellifera carnica, female, SRS403330)	175,796,462	142,588,520 (81.11%)	32,914,235 (18.72%)	107,912
SRS403331	Brain (Apis mellifera carnica, female, SRS403331)	168,267,063	138,427,208 (82.27%)	32,563,291 (19.35%)	107,374
SRS403332	Brain (Apis mellifera carnica, female, SRS403332)	178,934,269	145,959,552 (81.57%)	34,176,758 (19.10%)	107,924
SRS403333	Brain (Apis mellifera carnica, male, SRS403333)	189,487,841	153,691,847 (81.11%)	35,806,665 (18.90%)	107,677
SRS403334	Brain (Apis mellifera carnica, male, SRS403334)	162,754,857	132,703,125 (81.54%)	30,285,593 (18.61%)	106,698
SRS403335	Brain (Apis mellifera carnica, male, SRS403335)	153,692,102	124,647,540 (81.10%)	29,040,230 (18.90%)	104,807
SRS418897	Whole body (Apis mellifera,, SRS418897)	814,750	364,357 (44.72%)	103,122 (12.66%)	23,628
SRS418898	Abdomen (Apis mellifera,, SRS418898)	622,279	282,163 (45.34%)	114,093 (18.33%)	23,635
SRS419029	Antennae (Apis mellifera,, SRS419029)	1,135,230	793,681 (69.91%)	443,761 (39.09%)	32,867
SRS419030	embryo (Apis mellifera,, SRS419030)	1,136,885	799,754 (70.35%)	440,633 (38.76%)	38,894
SRS419031	Brain and ovary (Apis mellifera,, SRS419031)	1,556,239	952,100 (61.18%)	355,414 (22.84%)	53,728
SRS419032	Testes (Apis mellifera,, SRS419032)	1,317,973	972,234 (73.77%)	577,491 (43.82%)	28,944
SRS419033	Larvae (Apis mellifera,, SRS419033)	1,088,688	851,227 (78.19%)	515,662 (47.37%)	27,576

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Insecta GenBank	74,336	52,502 (70.63%)	52,502 (70.63%)	66.41%	62.61%
Insecta RefSeq	5,986	4,791 (80.04%)	4,791 (80.04%)	67.41%	63.47%
Drosophila melanogaster known RefSeq (NP_)	27,764	18,256 (65.75%)	18,256 (65.75%)	63.52%	52.97%
Homo sapiens known RefSeq (NP_)	36,418	19,943 (54.76%)	19,943 (54.76%)	57.62%	38.66%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences