A total RNA sample from D. galeata from a mixture of 24 clonal lines from four different lakes was sequenced using the Illumina MiSeq platform, producing a total of 40.6 million reads. These consisted of roughly 20.3 million PE reads of 250bp length. For the de novo transcriptome assembly, multiple assemblies with four different programs (Trinity, SOAPdenovo, Oases-Velvet, Trans-ABySS) and different k-mer sizes were combined. The de novo assemblers produced between 100,749 (Trinity) and 489,649 (SOAPdenovo) contigs, with a combined total of 1,218,949 (table 1). Applying CD-HIT-EST where necessary considerably reduced the redundancy of the data set; 583,357 contigs were merged together and further processed with the EviGene pipeline. The EviGene pipeline, used to merge different assemblies, classified 32,903 transcripts into the okay-main set and 47,849 transcripts into the alternative set. No particular assembler stood out as delivering very few or many transcripts, but there were differences among assemblers (table 1). Furthermore, Trinity was better in recovering the longest proteins: 532 of the 1000 longest proteins were obtained with this assembler. The number of obtained transcripts agrees well with the number of described genes in the related species D. pulex (30,810). In addition to the 32,903 transcripts, the tr2aacds script from the EviGene pipeline also produced a set of CDS and proteins.
Less...