A chromosome-level assembly of the widely used Rockefeller strain of Aedes aegypti, the yellow fever mosquito

Abstract Aedes aegypti is the vector of important human diseases, and genomic resources are crucial in facilitating the study of A. aegypti and its ecosystem interactions. Several laboratory-acclimated strains of this mosquito have been established, but the most used strain in toxicology studies is “Rockefeller,” which was originally collected and established in Cuba 130 years ago. A full-length genome assembly of another reference strain, “Liverpool,” was published in 2018 and is the reference genome for the species (AaegL5). However, genetic studies with the Rockefeller strain are complicated by the availability of only the Liverpool strain as the reference genome. Differences between Liverpool and Rockefeller have been known for decades, particularly in the expression of genes relevant to mosquito behavior and vector control (e.g. olfactory). These differences indicate that AaegL5 is likely not fully representative of the Rockefeller genome, presenting potential impediments to research. Here, we present a chromosomal-level assembly and annotation of the Rockefeller genome and a comparative characterization vs the Liverpool genome. Our results set the stage for a pan-genomic approach to understanding evolution and diversity within this important disease vector.

3) Draft assembly with Canu canu/v2.1.1Canu was run on the Xanadu cluster (using SLURM grid manager) at the University of Connecticut's Computational Biology Core (UCONN CBC).A special partition was set up comprising ten nodes each with 40 cores and 192GB RAM.Canu was run with the useGrid mode, with 4 processors and 8GB of RAM provided to the executive job.The above -corMhapOptions are based on the developers' recommendations per the Canu 2.1 usage FAQ at https://canu.readthedocs.io/en/stable/faq.html#my-assembly-is-running-out-of-space-is-too-slow.
Restarts were required at the final 'unitigging' phase because the initial memory allocation by the Canu executive were insufficient.

5.2) Align trimmed reads
We made a bowtie2 index from the assembly: Then we aligned the Illumina reads: Reads were also aligned with bwa mem for coverage analysis purposes:

5.3) Run Pilon
Pilon was run on the UCHC Xanadu Cluster on a --partition=himem node, requesting 16 CPUs and 400 GiB RAM.
The output from pilon was run through purge_haplotigs (Step 4) a second time.

6) Scaffolding with RagTag
RagTag was installed in a local user directory on the UCHC Xanadu Cluster by cloning the repository from github.(https://github.com/malonge/RagTag). RagTag was run with minimap2 as the aligner, using python3 from anaconda3 (v 4.1.1).The cluster job was allocated 32 CPUs and 32 Gib of RAM.Parameters explanation: --mm2-params "-x asm5 -c" --This is passed through to minimap2 and tells it to use presets for an assembly with ~5% divergence to the reference, and to output CIGAR notation to the .pafalignment file (so that that file can be converted to SAM if desired).
-r --infer gap sizes where possible (if not set, all gaps are set to 100bp of "N").
-u --Add a suffix to the unplaced contig headers (required for NCBI conformity)

8.3) Remove Aedes aegypti proteins from the repeat library
This library, LVP_v5_RepeatModeler.consensi.fa.classified, is expected to (and does) include some models that correspond to actual Aedes aegypti genes.We don't want to mask those for alignment because they are some of our best information for aligning the genome assemblies.
So we remove them by blastx of the classified consensi models against LVP proteome.
The models that match will be listed under the qseqid header, which is the first column in the specific tab delimited blast output format.
We filter those out of the consensi using a custom python script available in my utility scripts github repository: filter_out.py The resulting RepeatsLibrary becomes the "full" repeats library.The "filtered" repeats library was created by simply removing all of the filtered repeats that were classified as "unknown": 8.4) Run RepeatMasker using the custom repeat models.
The resulting "soft masked" FASTA file (where repetitive content is lower-cased) can be converted to a "hard masked" file by use of a tiny perl script (softmask-2-hard.pl).Hard-masked FASTAs can be used (if desired) for whole genome to whole genome alignment with MUMmer (below).

9) ab initio gene model prediction with BRAKER2
9.1) Create RNA-seq hints from previously published data sickle/v1.33hisat2/v2.2.1 samtools/v1.9 These RNA-seq data are available from the NCBI SRA under the bioproject PRJNA753843.For this part of the project, we used the four libraries that were made from ROCK female mosquito abdomens (SRX11720683, SRX11720684, SRX11720687, SRX11720688).
FastQC indicated no adapter content for these reads.RNA-seq reads were trimmed using sickle in single-end mode: The trimmed reads were aligned to a HiSat2 database made from the scaffolded genome: Overall mapping rates were around 97%: All four libraries' BAM files were then merged into one using samtools merge.

9.3) Run BRAKER2
Full details on how to run BRAKER2 and the additional files needed are available at the authors' repository: Gaius-Augustus/BRAKER.

10) Orthology determination of BRAKER models with EnTAP
EnTAP/v0.9.0-beta Full documentation on the usage of EnTAP is available at the authors' website: EnTAP@ReadTheDocs  gFACs is a suite of perl scripts that parses annotation files (gtf/gff), filters gene models, sanitizes/standardizes output, and creates CDS and protein fasta files for the predicted gene set.
Full usage information is available at the authors' website: gFACs@ReadTheDocs gFACs was run on a single desktop computer with 8-core processor and 32G of RAM (gFACs is not multi-threaded).We used gFACs in three passes; first to convert the braker2 gtf to a gene_table, because braker2 gtfs are difficult (time-consuming) to parse.We broke the braker gtf up into eight files, separating GeneMark models and Augustus models for each chromosome (=6) and for the models on unplaced contigs (=2 more), in order to run parsing in parallel.Then we concatenated the resulting gene tables and ran gFACs with the scaffolded assembly provided to produce a predicted proteome multifasta for EnTAP.The final pass was to filter gene models based on EnTAP annotations.The first step was to make a repeat-masked BLASTdb of the ROCK scaffolds:

12.1) long non-coding RNA (lncRNA) annotation
For identifying putative lncRNA, we wanted to just get the single best hit for each of the annotated ncRNAs from AaegL5.3.

12.2) Gene model searching by tBLASTn
Because the BRAKER2 annotations were incomplete, we wanted to make sure we weren't truly missing a large amount of genes that should be present.tBLASTn (amino acid sequence query, translated nucleotide database) was run on the UCHC Xanadu Cluster in an array job to save wallclock time.

13) Annotation remapping with Liftoff
Liftoff/v1.6.3 installed via bioconda Liftoff takes a reference fasta and its annotations, and via minimap2 alignment, maps coordinates of annotations to a target fasta.

14.1) Alignment with nucmer
These were done with the hard-masked versions of the genomes (filtered/no unknowns, ~50% masked), except that the hardmasked "Ns" in LVP were changed to hardmasked "Xs" to prevent the aligner from considering them.Alignments were run on the UCONN CBC "Xanadu" cluster, requesting 48 cpus and 60 GiB RAM.

14.2) Filtering
The output of nucmer is a .deltafile of all alignments that meet the criteria given.For visualizing the best alignments, we first use delta-filter : These options give the 1-to-1 best alignments in the reference and the query ( -1 ) that are longer than 10,000 bp ( -l 10000 ) and better than 85% identity ( -i 85 ) ANNOTS=/mnt/d/Aedes_Liverpool_Reference/RefSeq/GCF_002204515.2_AaegL5.0_genomic.Then create a dot-plot (synteny plot) with mummerplot and gnuplot: For the high-level chromosome-only view, QFILE is the temporary names, the lengths, and the orientations of the ROCK chromosomes: and RFILE is the same for LVP:

14.4) Breakpoints analysis
show-diff is used on the unfiltered delta file to report all "breakpoints" in the alignment --places where synteny is lost.
The numbers given in Table 3 are derived from the output of show-diff after parsing and counting in R. The R code used and the session_info() are provided in on github at https://github.com/fishercera/Aedes_Rockefeller_Genome/blob/main/ParseShowDiff.md.

15) Population-level heterogeneity analysis (SNP calling)
The SNP calling pipeline uses bowtie2 alignments that are piped into freebayes to produce a variant call file.The variant call file is then filtered and characterized using bedtools and bedops.
This variant calling pipeline was adapted from a tutorial by Noah Reid, UCONN CBC staff 15.1) Align reads bowtie2/v2.3.5.1 samtools/v1.9 Reads were aligned using the same commands from step 5.2 above.

15.2) Examine coverage for outliers
BEDtools/v2.9.0 bamtools/v2.5.1 htslib/v1.9 Made a simple BED file of 1000bp windows from the assembly: The file RK_F3.canu.v4.scaffolds.genome is a two-column text file listing scaffold id and length separated by a tab: Get coverage across those 1kb windows using bamtools and bedtools: Summarize number of counts per 1kb window (over a region of interest) with htslib and a small awk script:

15.3) Make targets file that excludes outliers
Based on the output of the tabix | awk command above, and after visually examining coverage using R, decided to use only regions with counts between 20 and 350 as targets for freebayes.