BRAKER3: Fully automated genome annotation using RNA-seq and protein evidence with GeneMark-ETP, AUGUSTUS and TSEBRA

Gene prediction has remained an active area of bioinformatics research for a long time. Still, gene prediction in large eukaryotic genomes presents a challenge that must be addressed by new algorithms. The amount and significance of the evidence available from transcriptomes and proteomes vary across genomes, between genes and even along a single gene. User-friendly and accurate annotation pipelines that can cope with such data heterogeneity are needed. The previously developed annotation pipelines BRAKER1 and BRAKER2 use RNA-seq or protein data, respectively, but not both. A further significant performance improvement was made by the recently released GeneMark-ETP integrating all three data types. We here present the BRAKER3 pipeline that builds on GeneMark-ETP and AUGUSTUS and further improves accuracy using the TSEBRA combiner. BRAKER3 annotates protein-coding genes in eukaryotic genomes using both short-read RNA-seq and a large protein database, along with statistical models learned iteratively and specifically for the target genome. We benchmarked the new pipeline on genomes of 11 species under assumed level of relatedness of the target species proteome to available proteomes. BRAKER3 outperformed BRAKER1 and BRAKER2. The average transcript-level F1-score was increased by ~20 percentage points on average, while the difference was most pronounced for species with large and complex genomes. BRAKER3 also outperformed other existing tools, MAKER2, Funannotate and FINDER. The code of BRAKER3 is available on GitHub and as a ready-to-run Docker container for execution with Docker or Singularity. Overall, BRAKER3 is an accurate, easy-to-use tool for eukaryotic genome annotation.

Supplemental Figure S1: Heatmap of F1-scores of pipelines being input short-read RNA-seq libraries and a protein database (with proteins of species from the respective order excluded).The last row shows the averages for the 11 different species.BRAKER3 actual versus predicted runtime Supplemental Figure S8: Actual runtime versus the runtime predicted for 19 whole genomes.The regression to predict the runtime (R 2 = 0.87) considered only the size of the genome and whether an OrthoDB partition was used (big database used=1) or only the proteomes of a few closely related genomes were used (big database used=0).
Supplemental Figure S9: Proteome completeness versus the number of genes.The horizontal axis shows the total gene count, after alternative transcripts were grouped into genes.The vertical axis shows the BUSCO completeness percentage (single-copy or duplicated) for the respective gene sets.The AUGUSTUS and GeneMark-ETP gene sets were taken from the output of BRAKER3.Supplemental Figure S10: Proteome completeness versus gene-level precision.The data is as in Supplemental Figure S9, except that the horizontal axis shows the percentage of predicted genes that identically share a transcript with the reference annotation.-repeats2evm updated 73.20 74.40 73.79 47.11 44.69 45.87 35.75 44.70 39.73 Supplemental Table S11: Average sensitivity, precision, and F1-score for four different predictions generated by Funannotate using the close relatives included protein databases for the same species as listed in Supplemental Table S8.The prediction step of Funannotate was run with and without the option to pass gene predictions of repetitive regions to EVidenceModeler (--repeats2evm).The resulting predictions were both post-processed using Funannotate's update protocol, which updates the predicted gene models using RNA-seq data.Here, transcriptome.fastacontained the same transcriptome assemblies that were used with BRAKER3 and constructed with HISAT2 and StringTie2.MAKER2 was then run with: 615 mpiexec.mpich-n 96 maker Preparing protein data OrthoDB v.11 was partitioned into proteins of species from the clade Arthropoda, Metazoa, Vertebrata, and Viridiplantae.The partitioning is available from https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/.Subsequently, two protein sets for each species from Suppl.Table 1 were gen-620 erated, excluding either only the target species (species-excluded) or all species of the same taxonomic order (order-excluded).These sets were prepared using the orthodb-clades pipeline, downloaded from GitHub (https://github.com/tomasbruna/orthodb-clades).snakemake --cores 48
1 SolTub 3.0 protein.faa.gzSolanumverrucosumGCF900185275.1 falcon-dt-bn protein.faa.gzSolanumpennelliiGCF001406875.1 SPENNV200 protein.faa.gzSupplementalTableS2:Donorproteinsused for each species for the close relative included protein set.Supplemental TableS4: Sensitivity (Sn) and precision (Prec) for a protein database in which proteins of the same order as the target species were excluded.The last subtable shows the respective averages for the 11 different species.The highest number in each column is indicated in bold text.Inputs were for each species a genome assembly, short-read RNA-seq libraries, and a protein database (respective order excluded).Supplemental TableS5: F1-scores of pipelines obtaining short-read RNA-seq libraries, and a protein database (respective order excluded) as input.The subtable on the bottom right shows the averages for the 11 different species.The highest number in each column is indicated in bold text.Supplemental TableS6: Sensitivity (Sn) and precision (Prec) for a protein database in which only proteins from the target species were excluded.The last subtable shows the respective averages for the 11 different species.The highest number in each column is indicated in bold text.Inputs were for each species a genome assembly, short-read RNA-seq libraries, and a protein database (respective species excluded).Supplemental TableS7: F1-scores of pipelines obtaining short-read RNA-seq libraries, and a protein database (respective species excluded) as input.The subtable on the bottom right shows the averages for the 11 different species.The highest number in each column is indicated in bold text.

Table S9 :
Sensitivity, precision, and F1-score of FINDER for runs using the same input data (genomic sequence + proteins + RNA-seq) as in the experiments from Supplemental TableS4.However, FINDER exited with an error for 4 out of the 11 species tested, and we therefore do not report its performance for those species.

Table S10 :
Sensitivity, precision, and F1-score of the AUGUSTUS predictions made as part of the BRAKER3 pipeline.The results correspond to Supplemental TableS4-proteins from the same order as the target species were excluded.

Table S14 :
Runtime of Funannotate and BRAKER3 for the experiments of Supplemental TableS8(close relatives included ).The runtime is written as hours and minutes.The hardware is described in the caption of Supplemental FigureS6.