Natural variation and evolutionary dynamics of transposable elements in Brassica oleracea based on next-generation sequencing data

Brassica oleracea comprises various economically important vegetables and presents extremely diverse morphological variations. They provide a rich source of nutrition for human health and have been used as a model system for studying polyploidization. Transposable elements (TEs) account for nearly 40% of the B. oleracea genome and contribute greatly to genetic diversity and genome evolution. Although the proliferation of TEs has led to a large expansion of the B. oleracea genome, little is known about the population dynamics and evolutionary activity of TEs. A comprehensive mobilome profile of 45,737 TE loci was obtained from resequencing data from 121 diverse accessions across nine B. oleracea morphotypes. Approximately 70% (32,195) of the loci showed insertion polymorphisms between or within morphotypes. In particular, up to 1221 loci were differentially fixed among morphotypes. Further analysis revealed that the distribution of the population frequency of TE loci was highly variable across different TE superfamilies and families, implying a diverse expansion history during host genome evolution. These findings provide better insight into the evolutionary dynamics and genetic diversity of B. oleracea genomes and will potentially serve as a valuable resource for molecular markers and association studies between TE-based genomic variations and morphotype-specific phenotypic differentiation.

In stage one: Firstly, the 150 bp split reads of the resequenced accessions were used as queries to search against TEs ends library by Bowtie2 with the following parameters: "--local -D 20 -R 3 -N 1 -L 20 -i S,1,0.5 --score-min G,0,13 -a". Reads with 30 ~ 130 bp match to the ends of TE sequences in the library were considered as TE-junction reads spanning a TE insertion site, which consisted of TE sequences and flanking sequences, and were selected for further analyses via custom Perl scripts.
Whereas, the reads aligned to TEs sequences with full length were removed due to lacking adequate information of flanking sequence for proper mapping. Secondly, in order to determine the genomic location of the insertions, the reads retained in the previous step were trimmed off TE portion and then mapped to the line 02-12 reference genome by Bowtie2 with the following parameters: "-a", then the reads uniquely mapped to reference genome were kept. Thirdly, a non-redundant list of TE loci were obtained by clustering the TE insertion sites according to the information of their genomic location. As a result, we identified a total of 62,051 non-redundant TE loci located in unique genomic sequences from B. oleracea population composed of 121 diverse varieties.
For a particular locus, the above analysis merely allowed us to acquire a list of accessions in which TE present. However, the remaining accessions still kept uncertain whether the observed absence of a TE insertion is true or is attributable to a lack of sequencing reads. In addition, the sequence divergence between TE at a particular site and the representative one maybe also miss a small number of accessions carrying TE insertions, which will discount the accuracy of the data. Therefore, it is still needed to further analysis with the following procedures to address precisely these kinds of issues.
In stage two, the 150 bp sequences immediately flanking the mapped TE loci were first extracted from reference genome and grouped into a database. Secondly, the NGS reads of all resequenced accessions were used as queries to search the flanking sequences database extracted in the last step one by one using bowtie2 with the following parameters: "--local -D 20 -R 3 -N 1 -L 20 -i S,1,0.5 --score-min G,0,13 -a".
Reads with full length alignment to flanking sequences were removed, while chimeric reads for which only part belong to flanking sequences were retained. Thirdly, these chimeric reads were trimmed off matching portion, and the remaining sequence of the reads were aligned to the end sequences of corresponding TEs by Blast+, to decide whether these locations probably possess the expected TE insertions. In this way, we could characterize presence or absence of TE insertions at all identified loci in individual genomes and the initial mobilome profile of B. oleracea was obtained.

The calculation of heterozygosity threshold
Since the ratio of the observed heterozygosity (Hobs, the proportion of heterozygous calls in all non-missing calls of a locus) to the expected heterozygosity (Hexp = 2pq, where p and q are the two allele frequencies of a locus) was expected to be distributed around 1-F, where F represented the inbreeding coefficient, we calculated the inbreeding coefficient (F) as the median value of 1-Hobs/Hexp at TE loci where Hobs/Hexp < 1. Then a threshold of max observed heterozygotes (Hobs_max) was obtained by the following equation: Hobs_max = 10 (1-F) Hexp, where a TE locus was considered as including excessive heterozygotes, if Hobs_max was ten times more than the most likely value for given frequency and inbreeding coefficient.
Supplementary Figure 1 The general schematic view of the procedure to identify TE insertion polymorphism. Supplementary Figure 2 The outline of the procedure for the identification of morphotypes-specific insertions.

TE-junction Reads
Supplementary Figure 3 The correlation between the number of morphotype-differential fixed loci and the level of population differentiation. The population pairwise estimates of Fst were calculated from the 13,181 polymorphic TE loci in the dataset using R package SNPRelate.
The level of population differentiation ( The effective population size (Ne) histories of the four B. oleracea morphotypes inferred from SNP variations using SMC++ (Terhorst et al., 2017). The average mutation rate (r) was set to be 1.5×10 -8 substitutions per site per year (Koch et al., 2000), and a generation time of 1 year was used to convert coalescent scaling to calendar time.