Evolution of a Cytoplasmic Determinant: Evidence for the Biochemical Basis of Functional Evolution of the Novel Germ Line Regulator Oskar

Abstracts Germ line specification is essential in sexually reproducing organisms. Despite their critical role, the evolutionary history of the genes that specify animal germ cells is heterogeneous and dynamic. In many insects, the gene oskar is required for the specification of the germ line. However, the germ line role of oskar is thought to be a derived role resulting from co-option from an ancestral somatic role. To address how evolutionary changes in protein sequence could have led to changes in the function of Oskar protein that enabled it to regulate germ line specification, we searched for oskar orthologs in 1,565 publicly available insect genomic and transcriptomic data sets. The earliest-diverging lineage in which we identified an oskar ortholog was the order Zygentoma (silverfish and firebrats), suggesting that oskar originated before the origin of winged insects. We noted some order-specific trends in oskar sequence evolution, including whole gene duplications, clade-specific losses, and rapid divergence. An alignment of all known 379 Oskar sequences revealed new highly conserved residues as candidates that promote dimerization of the LOTUS domain. Moreover, we identified regions of the OSK domain with conserved predicted RNA binding potential. Furthermore, we show that despite a low overall amino acid conservation, the LOTUS domain shows higher conservation of predicted secondary structure than the OSK domain. Finally, we suggest new key amino acids in the LOTUS domain that may be involved in the previously reported Oskar−Vasa physical interaction that is required for its germ line role.


Introduction
With the evolution of obligate multicellularity, many organisms faced a challenge considered a major evolutionary transition: allocating only some cells (germ line) to pass on their genetic material to the next generation, relegating the remainder (soma) to death upon death of the organism (reviewed in Kirk [2005]). Although there are multiple mechanisms of germ cell specification, they can be grouped into two broad categories, induction or inheritance (reviewed in Extavour and Akam [2003]). Under induction, cells respond to an external signal by adopting germ cell fate. Under the inheritance mechanism, maternally synthesized cytoplasmic molecules, located within a specialized cytoplasm called germ plasm, are deposited in the oocyte and "inherited" by a subset of cells during early embryonic divisions. Cells inheriting these molecules commit to a germ line fate (reviewed in Extavour and Akam [2003]).
The inheritance mechanism in insects that undergo metamorphosis (Holometabola) appears to have evolved by cooption of a key gene, oskar. oskar was first identified in forward genetic screens for axial patterning mutants in Drosophila melanogaster (Lehmann and Nüsslein-Volhard 1986). For the first 20 years following its discovery, oskar appeared to be restricted to Drosophilids (Clark et al. 2007). Its later discovery in the mosquitoes Aedes aegypti, Anopheles gambiae, and Culex quinquefasciatus (Juhn and James 2006;Juhn et al. 2008) and the wasp Nasonia vitripennis (Lynch et al. 2011) suggested the hypothesis that oskar emerged at the base of the Holometabola, and facilitated the evolution of germ plasm in these insects (Lynch et al. 2011). However, our subsequent identification of oskar homologs in the cricket Gryllus bimaculatus (Ewen-Campen et al. 2012), and in many additional hemimetabolous insect species (Blondel et al. 2020), demonstrated that oskar predates the Holometabola, and must be at least as old as the major radiation of insects (Misof et al. 2014). Two secondary losses of oskar from insect genomes have also been reported, in the beetle Tribolium castaneum (Lynch et al. 2011) and the honeybee Apis mellifera , and neither of these insects appear to use germ plasm to establish their germ lines (Nelson 1915;Nagy et al. 1994;Dearden 2006;Schroder 2006). Whether oskar is ubiquitous across all insect orders, whether it is truly unique to insects, the evidence for or against potential losses or duplications of the oskar locus across insects, and the evolutionary dynamics of the locus, remain unknown.oskar remains, to our knowledge, the only gene that has been experimentally demonstrated to be both necessary and sufficient to induce the formation of functional primordial germ cells (called pole cells in Drosophila) (Kim-Ha et al. 1991;Ephrussi and Lehmann 1992). Thus, in D. melanogaster (Lehmann and Nüsslein-Volhard 1986;Kim-Ha et al. 1991;Ephrussi and Lehmann 1992) and potentially more broadly in holometabolous insects with germ plasm (Lynch et al. 2011;Rafiqi et al. 2020), oskar plays an essential germ line role. However, it is clear that oskar's germ line function can evolve rapidly, as even within the genus Drosophila, oskar homologs from different species cannot always substitute for each other (Webster et al. 1994;Jones and Macdonald 2007). Moreover, the ancestral function of this gene may have been in the nervous system rather than the germ line (Ewen-Campen et al. 2012). The current hypothesis is therefore that it was co-opted to play a key role in the acquisition of an inheritance-based germ line specification mechanism $300 Mya (Misof et al. 2014), in the lineage leading to the Holometabola (Ewen-Campen et al. 2012). Thus, the case of oskar offers an opportunity to study the evolution of protein function at multiple levels of biological organization, from the genesis of a novel protein, through to potential co-option events and the evolution of functional variation.
Neofunctionalization often correlates with a change in the fitness landscape of the protein sequence caused by novel biochemical constraints imposed by amino acid sequence changes (Sikosek et al. 2012;Sikosek and Chan 2014). Such potential constraints may be revealed by analyzing the conservation of amino acids, their chemical properties, or structure at the secondary, tertiary or quaternary levels (Sikosek and Chan 2014). Oskar has two well-structured domains conserved across identified homologs to date (Blondel et al. 2020): an N-terminal Helix Turn Helix domain termed LOTUS with potential RNA-binding properties (Anantharaman et al. 2010;Jeske et al. 2015;Yang et al. 2015;Jeske et al. 2017), and a C-terminal GDSL-lipase-like domain called OSK (Jeske et al. 2015;Yang et al. 2015) (fig. 1). These two domains are linked by an unstructured highly variable interdomain sequence (Ahuja and Extavour 2014;Jeske et al. 2015;Yang et al. 2015). We previously showed that this domain structure is likely the result of a horizontal transfer event of a bacterial GDSL-lipase-like domain, followed by the fusion of this domain with a LOTUS domain in the host genome (Blondel et al. 2020). Biochemical assays of the properties of the LOTUS and OSK domains provide some clues as to the molecular mechanisms that Oskar uses to assemble germ plasm in D. melanogaster. The LOTUS domain is capable of homodimerization (Jeske et al. 2015(Jeske et al. , 2017, and directly binds and enhances the helicase activity of the ATP-dependent DEAD box helicase Vasa, a germ plasm component (Jeske et al. 2017). The OSK domain resembles GDSL lipases in sequence (Jeske et al. 2015;Yang et al. 2015;Blondel et al. 2020), but is predicted to lack enzymatic activity, as the conserved amino acid triad (S200 D202 H205) that defines the active site of these lipases is not conserved in OSK (Anantharaman et al. 2010;Jeske et al. 2015;Yang et al. 2015). Instead, copurification experiments suggest that OSK has RNA-binding properties, consistent with its predicted basic surface residues (Jeske et al. 2015;Yang et al. 2015). Whether or how changes in the primary sequence of Oskar can explain the evolution of its molecular mechanism or tissue-specific function, remain unknown.
To date, sequences of $100 oskar homologs have been reported (Lynch et al. 2011;Jeske et al. 2015;Quan and Lynch 2016;Blondel et al. 2020). However, the vast majority of these are from the Holometabola, and it is thus unclear whether analysis of these sequences alone would have sufficient power to allow extrapolation of conservation and divergence of putative biochemical properties across insects broadly speaking. Multiple hypotheses as to the molecular mechanistic function of particular amino acids in the LOTUS and OSK domains in D. melanogaster have been proposed (Jeske et al. 2015;Yang et al. 2015;Jeske et al. 2017), but without sufficient taxon sampling, the potential relevance of these mechanisms to oskar's evolution and function in other insects is unclear.
Here we address these outstanding questions by applying a rigorous bioinformatic pipeline to generate the most complete collection of oskar sequences to date. By analyzing 1,862 Pancrustacean genomes and transcriptomes, we show that oskar likely first arose at least 400 Ma, before the advent of winged insects (Pterygota). We find that the oskar locus has been lost independently in some insect orders, including near-total absence from the order Hemiptera, and clarify that the absence of oskar from the Bombyx mori and T. castaneum genomes (discussed in Quan and Lynch 2016) does not reflect a general absence of oskar from Lepidoptera or Coleoptera. By comparing Oskar sequences in a phylogenetic context, we reveal that distinct biophysical properties of Oskar are associated with Hemimetabola and Holometabola. We use these observations to propose testable hypotheses regarding the putative biochemical basis of evolutionary change in Oskar function across insects.

HMM-Based Discovery Pipeline Yields Hundreds of Novel oskar Homologs
We wished to study the evolution of the oskar gene sequence as comprehensively as possible across all insects. To expand our previous collection of nearly 100 homologous sequences (Blondel et al. 2020), we designed a new bioinformatics pipeline to scan and search for oskar homologs across all 1,565 Blondel et al. . doi:10.1093/molbev/msab284 MBE NCBI insect transcriptomes and genomes that were publicly available at the time of analysis (supplementary table S1, Supplementary Material online; fig. 2; see Genome and Transcriptome Preprocessing in Materials and Methods for NCBI accession numbers and additional information). First, we used the HMMER tool suite to build HMM models for each of the LOTUS and OSK domains, using our previously generated multiple sequence alignments (MSA) (Blondel et al. 2020). We subjected genomes to in silico gene model inference using Augustus (Stanke et al. 2006). We translated the resulting predicted transcripts, as well as the predicted transcripts from RNA-seq data sets, in all six frames. We then scanned the resulting protein sequences for the presence of LOTUS and OSK domains using the aforementioned HMM models. Sequences were designated as oskar homologs based on the same criteria as in our previous study (Blondel et al. 2020), namely, sequences containing both a LOTUS and an OSK domain (Jeske et al. 2015), separated by a variable interdomain region. We then aligned all sequences using hmmalign and the HMM derived from our previously published full length Oskar alignment (Blondel et al. 2020). The first iteration of the alignment was manually curated as previously described (Blondel et al. 2020), and sequence duplicates and sequences that did not align correctly were removed. All subsequent iterations were automatically curated following the process described in Materials and Methods: Identification of Oskar Homologs.
With these methods, we recovered a total of 379 unique oskar sequences from 350 unique species. To our knowledge, this comprises the largest collection of oskar homologs described to date. To determine if oskar homologs might predate Insecta, we applied the discovery pipeline to all 31 genomes and 266 transcriptomes of noninsect pancrustaceans available at the time of analysis (see Genomes and Transcriptomes Preprocessing in Materials and Methods for complete list). However, we did not recover any noninsect sequences meeting our criteria for oskar homologs ( fig. 3), strongly suggesting that oskar is restricted to the insect lineage (Lynch et al. 2011;Ahuja and Extavour 2014).
We found that 58.65% of RefSeq genomes (78/133), 30.42% of GenBank genomes (94/309), and 21.19% of transcriptomes (238/1,123) analyzed contained predicted oskar homologs (supplementary table S1 and fig. S1a, Supplementary Material online). Given that detection of putative homologs is highly dependent on the quality of the genome assembly and annotation, we asked whether there were differences in the assembly statistics of genomes with and without predicted oskar homologs. We observed a significant difference in N50, L50, number of contigs, and number of scaffolds between genomes lacking oskar hits and those where oskar was identified (MannÀWhitney U test P value < 0.05). Genomes where we did not find oskar showed a significantly higher mean/median contig and scaffold count, smaller contig and scaffold N50 length, larger contig and scaffold L50, and more contigs or scaffolds per genome length, than genomes where we detected an oskar homolog (MannÀWhitney U test P < 0.05; supplementary fig. S2 and table S2, Supplementary Material online). We interpret this to mean that oskar may appear to be absent from these data sets due to potentially incomplete sequencing, suggesting that deeper sequencing in these lineages could possibly reveal additional new oskar homologs in future studies. However, we note that we believe that our analysis provides strong evidence for true oskar loss in at least some lineages, given their very deeply sequenced and well-annotated genomes (e.g., A. mellifera, T. castaneum).
oskar Predates the Divergence of Ametabola and Other Insects We identified oskar homologs in 15 of the 29 generally recognized insect orders (Misof et al. 2014 MBE 2020). The novel finding of an oskar homolog from the silverfish Atelura formicaria (Zygentoma) allows us to date back the origin of oskar further than previous analyses, to at least 420 Ma (Misof et al. 2014), before the divergence of Ametabola from the remaining insect lineages. We then explored the distribution of oskar sequences across insect phylogeny. Interestingly, we identified multiple lineages where oskar appeared to have been lost independently, including confirming the previously reported (Lynch et al. 2011) losses from the genomes of the red flour beetle T. castaneum, the honeybee A. mellifera, and the silk moth B. mori ( fig. 3). Notably, within Lepidoptera we identified oskar homologs in only 3 species, despite the fact that we searched 232 available lepidopteran sequence data sets, including 17 well-annotated RefSeq genomes and 135 transcriptomes ( fig. 3;  FIG. 2. Schematic presentation of the oskar homolog detection pipeline. Sequences were collected automatically from the three NCBI databases, GenBank (GCA), RefSeq (GCF), and TSA database. RefSeq genomes were used to generate Augustus gene model HMMs, which were used to annotate and predict proteins in the nonannotated genomes obtained from GenBank. Transcripts from the TSA database were six-frame translated using TRANSEQ. Amino acid sequences were consolidated into three protein databases. hmmsearch from the HMMER tool suite was used to search for LOTUS and OSK hits in those sequences. Sequences with hits for both the LOTUS and OSK domains with an E-value < 0.05 were annotated as oskar sequences. Sequences were then cleaned to remove duplicates (sequences with <80% sequence similarity coming from the same organism). The resulting sequences were aligned using hmmalign, and the process was repeated until no new sequences were identified. Finally, the sequences were consolidated with the data set metadata into the oskar homolog database that was used for all subsequent analyses. Blondel et al. . doi:10.1093/molbev/msab284 MBE online). In principle, this apparent widespread absence of oskar in Lepidoptera could be due to unusually rapid evolution of the oskar sequence in this lineage, which might render lepidopteran oskar homologs undetectable by our methods.
However, we note that the only four lepidopteran homologs we detected all belonged to species of the basally branching Adelidae and Palaephatidae families. We therefore favor the interpretation that oskar was lost from a last common 3. Summary of oskar distribution and expression in insects. Phylogeny from Misof et al. (2014). Symbols in order from left to right: (i) vertical rectangles: gray: no oskar homolog was identified in this order. Color (unique for each order): at least one oskar homolog was identified in this order. (ii) Number of data sets searched. (iii) Horizontal rectangles: proportion of searched data sets in which an oskar homolog was identified. (iv) Pie chart: proportion of oskar sequences identified in RefSeq (GCF) data sets. (v) Pie chart: proportion of oskar sequences identified in GenBank (GCA) data sets. (vi) Pie chart: proportion of oskar sequences identified in TSA database data sets; (vii) oskar sequences identified in tissue related to germ line (transcriptomes derived from reproductive organs, eggs, or embryos); (viii) oskar sequences identified in tissue related to the brain (transcriptomes derived from brain or head); (ix) oskar sequences identified in an egg stage transcriptome; (x) oskar sequences identified in a larval stage transcriptome; (xi) oskar sequences identified in a pupal stage transcriptome; (xii) oskar sequences identified in a nymphal or juvenile stage transcriptome; (xiii) oskar sequences identified in an adult transcriptome. All numbers represented graphically here are in supplementary table S1, Supplementary Material online. No data sets were available for Protura, Diplura, or Isoptera at the time of analysis.  (Mitter et al. 2017;Kawahara et al. 2019).
The Hemiptera also appear to have lost oskar, based on our analysis of the 222 data sets available for this clade, including 12 RefSeq genomes and 192 transcriptomes. However, we did identify an oskar homolog in the Thysanoptera, which is a hemipteran sister group (Misof et al. 2014). Finally, we identified oskar homologs in only four of the 11 orders of the Polyneoptera for which data were available. With the exception of Mantodea (13 transcriptomes), the four orders with detectable oskar sequences all had more than ten available sequence data sets (Plecoptera: three genomes and eight transcriptomes; Orthoptera: three genomes and 28 transcriptomes; Phasmatodea: 13 genomes and 31 transcriptomes; Blattodea: five genomes and 51 transcriptomes). The remaining orders had fewer than eight data sets each available for analysis ( fig. 3; supplementary table S1, Supplementary Material online), which could account for the apparent paucity of oskar genes in this group. However, we cannot rule out the possibility that oskar in the Polyneoptera may have diverged beyond our ability to detect it, or that it may have been lost multiple times, as observed for multiple holometabolous orders.
As well as multiple convergent losses of oskar, we also uncovered evidence for independent instances of duplication of the oskar locus. We defined a putative duplication instance as two or more oskar sequences (possessing both a LOTUS and OSK domain as per our definition) in the same species that shared <80% sequence similarity. All of these events were detected within the Hymenoptera. We therefore performed a phylogenetic analysis of the hymenopteran sequences to test the hypothesis that these were the result of duplication events ( fig. 4; supplementary fig. S4, Supplementary Material online). Our analysis of hymenopteran oskar sequences recovered previously published hymenopteran phylogenetic relationships (Peters et al. 2017). We found that oskar was duplicated in the four Figitidae species studied, a family of parasitoid wasps. Moreover, one out of ten examined Cynipidae species, as well as the only Ceraphronidae species examined, also harbored a duplicated oskar sequence. Multiple oskar duplications were also identified in the Chalcidoid wasps, notably in the Mymaridae (all three species studied), the Eupelmidae (two out of three species), the Aphelinidae (both species), and the Pteromalidae (one out of 17 species). Finally, we identified two additional apparently independent duplication events in the Aculeata, one in the wasp Polistes fuscatus [of 29 Vespidae, including three additional Polistes species, two with RefSeq genomes (P. canadensis and P. dominula) in which oskar was identified in single copy], and one in the red imported fire ant Solenopsis invicta (of 41 Formicidae species, including the congeneric S. fugax, with a GenBank genome in which oskar was identified in single copy).

Evidence for oskar Expression in Multiple Somatic Tissues
In studied insects to date, oskar is expressed and required in one or both of the germ line (Juhn and James 2006;Juhn et al. 2008;Lynch et al. 2011;Lehmann 2016) or the nervous system (Ewen-Campen et al. 2012;Xu et al. 2013). We asked whether these expression patterns could be detected in the insects studied here. To this end, we downloaded all available metadata for the transcriptomes analyzed here, to obtain information on the source tissues and developmental stages. We obtained these data for 371 out of the 1,123 transcriptomes in our analysis, including both holometabolous and hemimetabolous orders (see TSA Metadata Parsing and Curation in Materials and Methods). To first explore the distribution of oskar expression in the brain and the germ line, we binned the different tissues reported in the metadata into two categories, brain or germ line. This was done independently of the developmental stage (if that information was included in the metadata) by creating a mapping table and checking the extracted tissues against this table (supplementary table S3 at GitHub repository TableS3_germline_brain_table.csv, Supplementary Material online). We then cross referenced our homology detection with these metadata. We found evidence for oskar expression in the germ line of four orders (Phasmatodea, Hymenoptera, Coleoptera, and Diptera), and in the brain of five orders (Orthoptera, Blattodea, Hymenoptera, Coleoptera, Diptera) (see TSA Metadata Parsing and Curation in Materials and Methods for details on keyword extractions). For the vast majority of the data sets examined, transcriptomes were not generated with comparable methods for different organ systems from the same species, such that we cannot make strong statements about the relative expression levels of oskar in the reproductive and nervous systems. However, we did perform a limited assessment of this question using previously published transcriptomes from the mosquito Aedes aegypti (Diptera) (Matthews et al. 2016 In addition, we found evidence of oskar expression in several somatic tissues not previously implicated in studies of oskar expression and function. These tissues included the midgut (P. fuscatus, Sitophilus oryzae), fat body (P. fuscatus, Arachnocampa luminosa), salivary gland (Culex tarsalis, Anopheles aquasalis, Leptinotarsa decemlineata), venom gland (Culicoides sonorensis, Fopius arisanus), and silk gland (Bactrocera cucurbitae) In terms of developmental stage, we detected expression of oskar during embryonic, larval, or nymphal stages only in holometabolous insects, and for most hemimetabolous insects, oskar was detected in transcriptomes derived from adults ( fig. 3). However, it is important to note that for most species, transcriptomes were available only from adult tissues, rather than from a full range of developmental stages (supplementary fig.  S5, Supplementary Material online). We therefore cannot rule out the possibility that oskar expression at preadult stages is also a feature of multiple Hemimetabola. Indeed, we previously reported that oskar is expressed and required in the embryonic nervous system of a cricket, a hemimetabolous insect (Ewen-Campen et al. 2012

Was man nia auro pun ctata
Form icida e

Mo no mo riu m ph ara on is
For mic ida e

So len op sis inv ict a
Fo rm icid ae

S ol en op si s fu ga x
Fo rm ici da e

S o le n o p si s in vi ct a
Fo rm ici da e

Ve sp a cr ab ro
Ves pida e  (Lynch et al. 2011;Ewen-Campen et al. 2012), and within our alignment of Oskar sequences we could only detect the Long Oskar isoform within Diptera. Therefore, using our data set, we asked when these two isoforms had evolved. We selected the dipteran sequences from our Oskar alignment and then grouped the sequences by family. We plotted the amino acid occupancy at each alignment position (supplementary fig. S9, Supplementary Material online), and found that Long Oskar predates the Drosophilids, being identified as early as the Pinpunculidae (supplementary fig. S9, Supplementary Material online). Moreover, following the evolution of the Long Oskar isoform, the Long Oskar domain was retained in all families except for the Glossinidae and Scathophagidae. However, given that we identified only eight and two Oskar sequences for these families respectively, we cannot eliminate the possibility that apparent absence of the Long Oskar domain in these groups reflects our small sample size, rather than true evolutionary loss.

The LOTUS and OSK Domains Evolved Differently between Hemimetabolous and Holometabolous Insects
The fact that an oskar-dependent germ plasm mode of germ line specification mechanism has been identified only in holometabolous insects suggests that oskar may have been coopted in this clade for this function (Ewen-Campen et al. 2012). Under this hypothesis, evolution of the oskar sequence in the lineage leading to the Holometabola may have changed the physico-chemical properties of Oskar protein, such that it acquired germ plasm nucleation abilities in these insects. To test this hypothesis, we asked whether there were particular sequence features associated with Oskar proteins from holometabolous insects, in which Oskar can assemble germ plasm, and hemimetabolous insects, which appear to lack oskar-dependent germ plasm. In particular, we assessed the differential conservation of amino acids at particular positions across Oskar and asked if these might be predicted to change the physico-chemical properties of Oskar in specific ways that could potentially be relevant to germ plasm nucleation. We used the Valdar score (Valdar 2002) as the main conservation indicator for this study (see GitHub file scores.csv), as this metric accounts not only for transition probabilities, stereochemical properties and amino acid frequency gaps, but also for the availability of sequence diversity in the data set. It computes a weighted score, where sequences from less well-represented clades contribute proportionally more to the score than sequences from over-represented clades. Due to the highly unbalanced availability of genomic and transcriptomic data between hemimetabolous and holometabolous sequences (supplementary table S1, Supplementary Material online; fig. 3) the choice of a weighted score was necessary to avoid biasing the results toward insect orders such as Diptera or Hymenoptera. To study the difference between hemimetabolous and holometabolous sequences, we did not use the Valdar score directly, but instead computed the conservation ratio between both groups for each position, which we call the conservation bias (see Computation of the Conservation Bias in Materials and Methods). We plotted the conservation bias on the solved 3D crystal structure of the D. melanogaster LOTUS and OSK domains (Jeske et al. 2015;Yang et al. 2015) to ask whether specific functionally relevant structures showed phylogenetic or other patterns of residue conservation ( fig. 5). First, we asked if the conservation score at the scale of domains was different between holometabolous and hemimetabolous sequences. We observed that the conservation bias for the LOTUS domain was centered around a mean of 1.00, indicating that both Holometabola and Hemimetabola displayed a similar conservation of the LOTUS domain ( fig.  5a). For the OSK domain, however, the conservation bias was centered around 0.84, indicating that the hemimetabolous sequences displayed a higher level of conservation compared with holometabolous sequences ( fig. 5a). To interrogate specific biochemical hypotheses, we then examined the degree of conservation bias in different regions of the protein structure. We asked if the amino acids of the b sheets of the LOTUS domain thought to be involved in dimerization of the protein (Jeske et al. 2015;Yang et al. 2015) displayed conservation bias. Both b sheets had an overall even bias (mean: 1.03 and 1.05 for b1 and b2, respectively) between both groups ( fig. 5b). Second, as we had observed that hemimetabolous OSK was more conserved overall than holometabolous OSK, we asked if there were any clear patterns of conservation bias in specific regions of the OSK domain ( fig. 5a and b). We found that some of the secondary structures within the OSK domain showed a differential conservation (a2: 0.54, a6: 0.42, b2: 0.52), whereas other structures were within <0.1 of the median value for OSK. Moreover, we observed a large pocket of amino acids showing a conservation bias toward hemimetabolous sequences located on the surface of OSK ( fig. 5c). This particular area contains the previously reported important amino acids for the RNA binding function of OSK (Jeske et al. 2015;Yang et al. 2015) namely, R442, R436, and R576. The electrostatic properties at those positions were conserved in the holometabolous sequences R436: 0.36, R442: 0.29 and R576: 0.81 ( fig. 5d), but not in hemimetabolous sequences. In other words, these specific amino acid residues are outliers in that they are more specifically conserved in holometabolous OSK sequences, but are located within a domain that overall is more conserved in Hemimetabola.
To gain further insight into the differences in conservation across insects, we reduced the MSA dimensionality using a multiple correspondence analysis (MCA), an equivalent of PCA for categorical variables (Lebart et al. 1984). We performed the dimensionality reduction for the full-length Oskar sequence alignment as well as for the LOTUS and OSK alignments (supplementary fig. S10, Supplementary Material online). Interestingly, we found that most of the variance in sequence space was due to dipterans and hymenopterans (supplementary fig. S10

Evidence for Evolution of Stronger Dimerization Potential of the Oskar LOTUS Domain in Holometabola
The LOTUS domain dimerizes in vitro through electrostatic and hydrophobic contacts of Arg215 of the b2 sheet and Thr195, Asp197, and Leu200 of the a2 helix (Jeske et al. 2015;Yang et al. 2015). To date, however, the biological significance of Oskar dimerization remains unknown. Moreover, the dimerization of the LOTUS domain does not appear to be conserved across all Oskar sequences (Jeske et al. 2015). Specifically, ten LOTUS domains from nondrosophilid species were tested for dimerization, and only LOTUS domains from Drosophilidae, Tephritidae, and Pteromalidae formed homodimers (Jeske et al. 2015). The other sequences tested, from Culicidae, Formicidae, and Gryllidae, remained monomeric under the tested conditions (Jeske et al. 2015). We selected the LOTUS sequences in our alignment from those six families and placed them into one of two groups, dimeric and monomeric LOTUS, under the assumption that any sequence from that family would conserve the dimerization (or absence thereof) properties previously reported (Jeske et al. 2015). We asked whether we could detect any evolutionary changes between the two groups in properties of known important dimerization interfaces and residues in our sequence alignment (Jeske et al. 2015).
In the D. melanogaster structure, two key amino acids, D197 and R215, are predicted to form hydrogen bonds that stabilize the dimer (Jeske et al. 2015). We found that in the dimer group, the electrostatic properties of these two amino acids are highly conserved (À0.75 for D197 and 0.81 for R215), whereas in the monomer group the electrostatic interaction is not conserved (0.03 for D197 and À0.11 for R215) ( fig. 6e). Given the differential conservation between the two groups, our results support the previous finding that disrupting this interaction prevents dimerization (Jeske et al. 2015). L200 was previously hypothesized to stabilize the interface via hydrophobic forces (Jeske et al. 2015). We observed that the hydrophobicity of this residue is highly conserved in the dimer group (L200: 0.89), but that in the monomer group this residue is hydrophilic (L200: 2.33) ( fig. 6f). In sum, our analyses show that key amino acids in the LOTUS domain evolved differently in distinct insect lineages, in a way that may explain why some insect LOTUS domains dimerize and some do not.

Conservation of the OskarÀVasa Interaction Interface
Next, we asked whether we could detect differential conservation of regions or residues within the LOTUSÀVasa interface. It was previously reported that the LOTUS domain of Oskar acts as an interaction domain with Vasa (Jeske et al. 2017), a key protein with a conserved role in the establishment of the animal germ line (Hay et al. 1990;Lasko 2013). The interaction between D. melanogaster Oskar's LOTUS domain and Vasa is through an interaction surface situated in the pocket formed by the helices a2 and a5 of the LOTUS domain ( fig. 6aÀc). Due to the essential role that vasa plays in germ line determination (reviewed in Raz [2000]; Noce et al. [2001]; Extavour and Akam [2003]; Ewen-Campen et al. [2010]; Lasko [2013]), and the potential co-option of oskar to the germ line determination mechanism in Holometabola (Ewen-Campen et al. 2012), we hypothesized that evolutionary conservation of the residues of this interface might be detectable. First, we observed that the residues of the LOTUS domain a2 and a5 helices, which directly contact Vasa (Jeske et al. 2017) were highly conserved overall (a2 average Valdar score 0.49; a5 Valdar score 0.56) ( fig. 6b). Specifically, we observed that the previously in vitro-confirmed Vasa interacting amino acids A162 and L228 of the LOTUS domain were highly conserved (Valdar score: 0.64 for both residues) (Jeske et al. 2017). We also noted that Q235 and H227 of the LOTUS domain a5 helix are also highly conserved, suggesting them as putative novel important interaction partners (Valdar score: 0.90 and 0.90 for both residues) ( fig. 6b). Moreover, facing the LOTUS domain H227 is Vasa M540, which may act as a proton donor to form a hydrogen bond between the histidine ring and the sulfur atom of the methionine (Pal and Chakrabarti 2001) ( fig. 6b and b'). The LOTUS domain a2 helix is overall slightly less conserved than the LOTUS domain a5 helix (Valdar score: 0.49 vs. 0.56) ( fig. 6a, b'', and c''), but hydrophobic properties are conserved on one side of the a2 helix ( fig. 6c and c') forming a motif of conserved amino acid properties ( fig. 6c'').
Previous reports have hypothesized that the D. melanogaster LOTUS domain could act as a dsRNA binding domain (Anantharaman et al. 2010;Callebaut and Mornon 2010). However, in D. melanogaster, it was later reported that the LOTUS domain did not bind to nucleotides (Jeske et al. 2015). Therefore, using our data set we assessed the potential RNA binding properties of LOTUS domains to test the conservation of this prediction. We used the RNABindR algorithm (Terribilini et al. 2007) to predict potential RNA binding sites of the LOTUS domain, and computed a conservation score for each position (Terribilini et al. 2007). We found that the a5 helix is the location in the LOTUS domain that has the most conserved prediction for RNA binding (fig. 6d). We therefore suggest that the possibility that LOTUS binds RNA directly warrants further experimental examination.
Finally, we asked whether the secondary structure of the LOTUS domain might be conserved. Secondary structures are often indicative of the tertiary structure of a domain. Therefore, we reasoned that the secondary structure might be conserved even if the sequence varies. We submitted the LOTUS sequences from all identified Oskar homologs to the Jpred4 servers (Drozdetskiy et al. 2015) for secondary structure prediction and mapped the results onto the Oskar alignment we obtained. We found that the secondary structure of LOTUS is highly conserved throughout Oskar homologs, with the exception of the a1 helix ( We asked whether the OSK domain showed any differential conservation across the different parts of the domain. We found that the OSK domain of Oskar showed an overall conservation across all insects, similar to the LOTUS domain (Valdar score: 0.51) ( fig. 7a). However, the conservation pattern is higher in the core amino acids (Valdar score average of core amino acid: 0.54) when compared with the residues at the surface (Valdar score average for surface amino acid: 0.23) ( fig. 7a). Despite the overall low conservation of the residues at the surface of the OSK domain, we found that the electrostatic properties are conserved overall (electrostatic conservation score > 0; conserved) in the previously reported putative RNA binding pocket (Yang et al. 2015). However, as previously mentioned, this conservation is stronger in holometabolous sequences (fig. 5d). These results are in accordance with the potential role of OSK as an RNA Binding domain in the context of germ plasm assembly (Jeske et al. 2015;Yang et al. 2015). We also submitted the OSK sequences to the same secondary structure analysis performed on LOTUS. We found that, as for the LOTUS domain, the secondary structure of OSK is highly conserved throughout all insect sequences analyzed (supplementary fig. S11, Supplementary Material online).
We then asked if the conservation patterns observed at the core of OSK were clustered in sequence motifs. When we looked at the location of the highly conserved amino acids, we found that the conservation was driven by four welldefined sequence motifs ( fig. 7c, c', c'', and c'''). Given that oskar plays different roles in Holometabola and Hemimetabola, we asked whether the conserved OSK motifs showed any difference in conservation between these two groups. Of the four highly conserved OSK core motifs (

An Expanded Collection of oskar Homologs
oskar provides a powerful case study of functional evolution of a gene with an unusual genesis (Blondel et al. 2020). Here, we gathered the most extensive set of homologous oskar sequences to date. However, most insect genomic and transcriptomic data have been generated from only a few orders, and the vast majority from the Holometabola. Diptera, Lepidoptera, Coleoptera, Hymenoptera, and Hemiptera represent 82% of the data sets available at the time of this analysis. We emphasize that expanded taxon sampling, particularly for the Hemimetabola, will be critical for further studies of the evolution of protein function across insects. Moreover, only a small proportion (27% for tissue type, 26% for organism stage, and 14% for sex) of the TSA data sets contained usable metadata regarding the stage and tissue type sampled. Standardization of the nature and format of transcriptomic metadata would also be a worthwhile endeavor that could increase the efficiency and efficacy of future work.

Convergent Losses and Duplications of oskar in Insect Evolution
A previous report suggested that oskar had been lost from the genome of the silk moth B. mori (Lynch et al. 2011). Our analysis of 232 data sets across 44 of the 126 described lepidopteran families (Kawahara et al. 2019) strongly suggests that the loss of oskar in the Lepidoptera (butterflies and moths) is not unique to the silk moth, but rather occurred early and repeatedly in lepidopteran evolution. The fact that oskar is a component of the oosome at the posterior of the oocyte (the wasp germ plasm analog; Quan et al. 2019) and required for germ cell formation in the wasp Nasonia vitripennis (Lynch et al. 2011) implies that a common ancestor of Holometabola had already established an oskar-dependent inheritance mode of germ line specification. Therefore, the apparent subsequent loss in nearly all Lepidoptera examined of a gene responsible for the establishment of the germ plasm in other Holometabola might seem unexpected. Few studies have directly addressed the molecular mechanisms of germ cell specification in Lepidoptera. In B. mori (Bombicidae), vasa mRNA (Nakao 1999), and protein (Nakao et al. 2006), and the transcripts of one of four nanos homologs (nanos-O) (Nakao et al. 2008), have been detected in a region of ventral cortical cytoplasm in preblastoderm stage embryos. As putative primordial germ cells form in this location at later stages (Miya 1958), some authors have speculated that a germ plasm, located ventrally rather than posteriorly, may specify germ cells in this moth (Toshiki et al. 2000;Nakao et al. 2008). However, recent knockdown experiments showed that maternal nanos-O is dispensable for germ cell formation (Nakao and Takasu 2019), consistent with a zygotic, inductive mechanism. In the butterfly Pararge aegeria (Nymphalidae), no oskar homolog has been identified in the genome (Carter et al. 2013), but the transcripts of one of four identified nanos homologs (nanos-O) have been detected in a small region of ventral cortical ooplasm, again prompting speculation that this lepidopteran may also deploy a germ plasm (Carter et al. 2015). We suggest that if these or other Lepidoptera do indeed rely on germ plasm to specify their germ line, they may do so using a germ plasm nucleator other than Oskar. For most studied Lepidoptera, however, classical embryological studies report the first appearance of primordial germ cells at postblastoderm stages, either from the ventral midline of the cellular blastoderm or early germ band (Woodworth 1889;Tomaya 1902;Sehl 1931;Miya 1953Miya , 1958Miya , 1975Tanaka 1987), from the celomic sac mesoderm  (Lehmann and Nüsslein-Volhard 1986;Kim-Ha et al. 1991); S457F ¼ osk[6B10] (Breitwieser et al. 1996)) or to reduced RNA-binding affinity of the OSK domain (R436E; Yang et al. 2015).
Evolution of a Cytoplasmic Determinant . doi:10.1093/molbev/msab284 MBE of the abdomen (Johannsen 1929;Eastham 1930;Saito 1937;Presser and Rutschky 1957;Kobayashi and Ando 1984), or from the primary ectoderm of the caudal germ band (Schwangart 1905;Lautenschlager 1932;Ando and Tanaka 1980;Tanaka 1987;Guelin 1994) (supplementary fig. S3, Supplementary Material online). Taken together, these data suggest that an inductive mechanism may operate to specify germ cells in most moths and butterflies. We speculate that the loss of oskar from most lepidopteran genomes may have facilitated or necessitated secondary reversion to the hypothesized ancestral inductive mechanism for germ line specification.
Another order with apparent near-total absence of oskar homologs is the Hemiptera (true bugs), whose sister group Thysanoptera (thrips) nevertheless possesses oskar. This secondary loss of oskar from a last common hemipteran ancestor correlates with the reported postblastoderm appearance of primordial germ cells in the embryo. Classical studies on most hemipteran species describe germ cell formation as occurring after cellular blastoderm formation, on the inner (yolk-facing) side of the posterior blastoderm surface (Metschnikoff 1866;Witlaczil 1884;Will 1888;Mellanby 1935;Butt 1949;Kelly and Huebner 1989;Heming and Huebner 1994). A notable exception to this is the parthenogenetic pea aphid Acyrthosiphon pisum, for which strong gene expression and morphological evidence supports a germ plasm-driven germ cell specification mechanism in both sexual and asexual modes (Miura et al. 2003;Chang et al. 2006;Lin et al. 2014). In contrast, studies of the aphids Aphis plantoides, A. rosea, and A. pelargonii describe no germ plasm, and postblastoderm germ cell formation (Metschnikoff 1866;Witlaczil 1884;Will 1888). However, the genomes of all aphids studied here, including A. pisum and three Aphis species, appear to lack oskar. This suggests that germ plasm assembly in A. pisum either does not require a nucleator molecule or uses a novel non-Oskar nucleator.
In the Hymenoptera (ants, bees, wasps, and sawflies), our results strongly suggest that oskar was lost from the genome of the last common ancestor of bees and spheroid wasps (supplementary fig. S12, Supplementary Material online). Our analysis further suggests multiple additional independent losses in as many as 25 other hymenopteran lineages, including some for which good quality RefSeq genomes were available (e.g., the slender twig ant Pseudomyrmex gracilis or the wheat stem sawfly Cephus cinctus) (supplementary fig. S12, Supplementary Material online). However, it would be premature to draw strong conclusions about the number of independent losses given the predominance of transcriptome data in the Hymenoptera.
In addition to convergent losses of oskar, we also found evidence for clade-specific duplications of oskar in the Hymenoptera. Seven of the nine families containing these putative duplications are families of parasitoid wasps; the remaining two families are ants (Formicidae) and the group of yellowjackets, hornets, and paper wasps (Vespidae) (fig. 4). The phylogenetic relationships of these groups make it highly unlikely that a duplication occurred only once in their last common ancestor, which would be the last common ancestor of all wasps, bees, and ants (i.e., Apocrita, all hymenopterans except sawflies) (supplementary fig. S12, Supplementary Material online). We suggest that the most parsimonious hypothesis is one of three to five independent duplications of oskar, followed by at least 9À14 independent reversions to a single copy, or total loss of the locus (supplementary fig. S9, Supplementary Material online).
No notable life history characteristics appear to unite those species with multiple oskar homologs: They include eusocial and solitary, sting-bearing and stingless, parasitoid and nonparasitic insects. To our knowledge, neither is there anything unique about the germ line specification process in Hymenoptera with one or more than one oskar homolog. Most Hymenoptera appear to use a germ plasm-driven mechanism to specify germ cells in early blastoderm stage embryos (supplementary fig. S12 and references therein, Supplementary Material online), and we identified oskar homologs for all such species described in the embryological literature (supplementary fig. S12, Supplementary Material online). In the notable example of the honeybee A. mellifera, in which cytological and molecular evidence suggests germ cell arise from abdominal mesoderm (Bütschli 1870;Nelson 1915;Sander 1985, 1986;Zissler 1992;Gutzeit et al. 1993;Dearden 2006), we identified no oskar homolog in its well-annotated genome (supplementary fig. S12, Supplementary Material online), as noted previously by other authors (Lynch et al. 2011). However, no major differences in germ plasm or pole cell formation have been reported in species or families of ants or wasps with duplicated oskar loci, compared with close relatives that possess oskar in single copy [e.g., compare the ants Solenopsis invicta (at least two oskars) and Aphaenogaster rudis (one oskar) (Khila and Abouheif 2008), or the pteromalid wasps Nasonia vitripennis (one oskar) (Lynch and Desplan 2010;Lynch et al. 2011;Quan et al. 2019) and Otitesella tsamvi (two oskars)]. Thus, future studies that independently abrogate the functions of each paralog individually, will be needed to determine the biological significance, if any, of these oskar duplications.

Evolution of the Long Oskar Domain
We showed that the Long Oskar domain is an evolutionary novelty confined to a subset of Diptera. This raises the question of whether the evolution of this domain led to any novel functional properties of oskar in these Diptera, relative to its functions in other insects. The only data available on the specific functions of the Long Oskar domain are from studies on D. melanogaster. The Long Oskar (606 amino acids: possessing the Long Oskar domain) and Short Oskar (468 amino acids: lacking the Long Oskar domain) isoforms are generated by translation of oskar mRNA from alternate initiation codons within the same transcript (Markussen et al. 1995). Short Oskar alone cannot maintain oskar mRNA or either protein isoform at the posterior pole of the oocyte or embryo (Vanzo and Ephrussi 2002). However, Short Oskar alone is able to promote the formation of pole cells, albeit many fewer than wild type (Markussen et al. 1995). In contrast, Long Oskar alone can anchor oskar mRNA, Oskar protein, and mitochondria at Blondel et al. . doi:10.1093/molbev/msab284 MBE the posterior pole, but cannot promote pole cell formation (Rongo et al. 1997;Vanzo and Ephrussi 2002;Hurd et al. 2016). In vitro, Short Oskar has a higher affinity for germ plasm components than Long Oskar (Breitwieser et al. 1996;Babu et al. 2004;Anne and Mechler 2005;Megosh et al. 2006;Suyama et al. 2009;Anne 2010). Furthermore, Short Oskar associates with the cytoplasmic germ granules themselves, whereas Long Oskar instead associates with endosomal membranes (Vanzo et al. 2007). These observations have led to the model that Long Oskar's main role is to recruit and anchor Short Oskar to the posterior, where Short Oskar is responsible for germ plasm assembly per se (Markussen et al. 1995;Vanzo and Ephrussi 2002;Tanaka and Nakamura 2008;Tanaka et al. 2011;Hurd et al. 2016).
The molecular basis for the apparently distinct roles of these two isoforms remains largely unclear, and is unlikely to reside entirely within the Long Oskar domain. In vivo assessments of the 139-amino acid Long Oskar domain alone show that it is necessary and sufficient to maintain mitochondria at the oocyte cortex (Hurd et al. 2016). This Long Oskar domain-mediated mitochondrial maintenance requires an intact F-actin cortical cytoskeleton, which is modified by the presence of the Long Oskar domain (Tanaka and Nakamura 2008;Tanaka et al. 2011;Hurd et al. 2016). Compared with controls, long oskar null mutant flies (possessing only Short Oskar) generate fewer PGCs with fewer mitochondria, and their ovaries lack germ cells more often than controls (Hurd et al. 2016).
Although the Long Oskar isoform thus appears to play important and unique roles in functional germ plasm assembly in D. melanogaster, these roles appear to be performed perfectly well by the single isoform possessed by nearly all other insects, which in terms of sequence is essentially equivalent to Short Oskar. One or more of posterior oskar and germ plasm localization, posterior pole cell formation, and mitochondrial enrichment within germ plasm have been reported for species of ants, bees, wasps, beetles, mosquitoes, and flies that all lack a Long Oskar isoform (Nardon 1971;Jaglarz et al. 2003;Goltsev et al. 2004;Zhurov et al. 2004;Juhn and James 2006;Nardon 2006;Juhn et al. 2008;Lynch et al. 2011;Yoon et al. 2019;Rafiqi et al. 2020). We note, however, that many of these species are reported to possess an oosome, which is a single, morphologically distinct discrete nonmembrane-bound organelle that houses germ plasm components (Meng 1968;Nardon 1971;Klag and Bilinski 1993;Jaglarz et al. 2003;Zhurov et al. 2004;Nardon 2006;Lynch et al. 2011;Quan et al. 2019). This is distinct from most Drosophila species for which data are available, whose germ plasm is in the form of multiple smaller granules loosely clustered near the posterior cortex (Mahowald 1962(Mahowald , 1968Mahowald et al. 1976). We therefore speculate that the evolution of the Long Oskar domain may have enabled tight cortical anchoring of germ plasm components via interaction with endosomes and/or the F-actin cytoskeleton, eliminating the need for an oosome to ensure integrity or local concentration of germ plasm.

Reexamination of Potential Interactions between the LOTUS Domain and RNA
Proteins with a LOTUS domain commonly participate in nucleic acid binding (Williams et al. 1993;Gajiwala and Burley 2000;Liu et al. 2001;Aravind et al. 2005;Lachke et al. 2011;Cui et al. 2013;Harami et al. 2013;Mukherjee et al. 2014). LOTUS domain-containing proteins, particularly RNA-binding proteins (Cui et al. 2013), are often enriched in germ plasm (Anantharaman et al. 2010;Callebaut and Mornon 2010), as are specific RNAs (Ephrussi et al. 1991;Wang and Lehmann 1991;Jongens et al. 1992;Smith et al. 1992;Kobayashi et al. 1995;Nakamura et al. 1996;Mahowald 2001;Vanzo and Ephrussi 2002;Ewen-Campen et al. 2010). However, to date there is no direct evidence that Oskar's LOTUS domain interacts directly with RNA. We were therefore intrigued to find that our bioinformatic analysis suggested that the LOTUS helix a5 might have binding RNA ability ( fig. 6d). Consistent with the possibility that Oskar's LOTUS domain might somehow interact with RNA in vivo, we have observed that a loss of function oskar allele lacking the entire LOTUS domain (oskar[DLOTUS]), is unable to direct accumulation of Nanos protein in the germ plasm (Extavour lab, unpublished observation). If the OSK domain, which unlike the LOTUS domain, binds nanos mRNA in vitro (Jeske et al. 2015;Yang et al. 2015), were sufficient to ensure Nanos protein localization via nanos mRNA recruitment, then germ plasm in oskar[DLOTUS] flies should contain Nanos protein. Our opposite result could indicate that LOTUS plays a role in RNA binding and/or local translation of nanos mRNA. In principle, this could be indirect, for example, aided by LOTUS-mediated oligomerization (Jeske et al. 2015;Yang et al. 2015), or it could be via direct LOTUSÀRNA contacts that have not yet been detected in biochemical studies. Further, we note that LOTUSÀRNA interactions have, to our knowledge, been probed biochemically and genetically only in D. melanogaster, which does not rule out the existence of such binding interactions in other insects.

Functional Implications of Differential Conservation of Regions of the LOTUS and OSK Domains
We have identified novel conserved amino acid positions that we hypothesize are important for the Vasa binding properties of the LOTUS domain and the RNA properties binding of the OSK domain (figs. 6 and 7). Our observation of the conservation of the LOTUS domain a2 helix is consistent with its previously reported importance in LOTUSÀVasa binding (Jeske et al. 2017). In the a2 helix, we also observed high conservation of H227 and Q235. The positions of these residues suggest they may contribute to the interaction between Vasa and LOTUS, but they have not, to our knowledge, yet been implicated functionally in vitro or in vivo. We suggest they should therefore be the target of future mutational studies. Moreover, evolution at the interface between two proteins involves amino acids on both sides of the surface. Therefore, further studies looking at potential coevolution between Oskar and Vasa could shed light on whether the Evolution of a Cytoplasmic Determinant . doi:10.1093/molbev/msab284 MBE conserved amino acids that we identified in the LOTUS domain interact with similarly conserved Vasa residues, or whether evolutionary variations in OskarÀVasa interactions may be explained by coevolution of specific residues at their interaction surfaces (Andreani et al. 2020).
We also uncovered an interesting new conservation pattern within the OSK domain. The conserved amino acids were more abundant in the core of the domain than on the surface. This differential conservation might be relevant to the acquisition of a germ plasm nucleator role of oskar in the Holometabla (fig. 5). We noted that the basic properties of surface residues previously reported for D. melanogaster (Yang et al. 2015) are conserved across insects, which might indicate that the RNA binding properties of OSK observed in D. melanogaster (Jeske et al. 2015;Yang et al. 2015) are also conserved throughout holometabolous insects. We speculate that the comparatively low amino acid conservation of the surface residues in Holometabolous OSK domains, which nevertheless display highly conserved basic properties, could have allowed greater flexibility in the coevolution of specific RNA binding partners for the OSK domains of different lineages.

OSK Evolved Differentially between Holometabolous and Hemimetabolous Insects
Finally, we observed a differential conservation of the OSK domain between hemimetabolous and holometabolous insects. Specifically, we found that the OSK sequence was less conserved across the Holometabola than across the Hemimetabola. This observation raises two potential hypotheses regarding the role of the OSK domain in the functional evolution of Oskar. First, perhaps the apparently relaxed purifying selection experienced by OSK in the Holometabola was necessary for the co-option of oskar to a germ plasm nucleation role. Second, Oskar might have a function in the hemimetabolous insects that requires strong conservation of OSK. More studies on the roles and biochemical properties of OSK in hemimetabolous insects will be required to test these hypotheses and further our understanding of the biological relevance of this differential conservation.
In conclusion, analysis of the large data set of novel Oskar sequences presented here provides multiple new testable hypotheses concerning the molecular mechanisms and functional evolution of oskar, that will inform future studies on the contribution of this unusual gene to the evolution of animal germ cell specification.

Lead Contact and Materials Availability
This study did not generate new unique reagents. This study generated new python3 code and supplementary files referred to below, all of which are available at https://github. com/extavourlab/Oskar_Evolution. Requests for further information and requests for resources and reagents should be directed to and will be fulfilled by Cassandra G. Extavour (extavour@oeb.harvard.edu).

Experimental Model and Subject Details
This study used no cell culture lines. This study used live samples of D. melanogaster and C. maculatus and ethanolpreserved samples of A. asperrimus. The study also used previously generated genomic and transcriptomic data sets. All the information regarding how those data sets were generated can be found on their respective NCBI pages. The list of all the data sets used in this study can be found in the following files: genome_insect_database.csv, transcriptome_in-sect_database.csv, genome_crustacean_database.csv, and transcriptome_crustacean_database.csv.

Genome and Transcriptome Preprocessing
We collected all available genome and transcriptome data sets from the NCBI repository registered in September 2019 ( fig. 2). NCBI maintains two tiers of genomic data: RefSeq, which contains curated and annotated genomes, and GenBank, which contains nonannotated assembled genomic sequences. Transcriptomes are stored in the transcriptome shotgun assembly (TSA) database, with metadata including details on their origin. Among the registered data sets, five genomes were not yet available, and 40 transcriptomes were only available in the NCBI Trace repository. As they did not comply with the TSA database standards, they were excluded from the analysis. To search for oskar homologs in data sets retrieved from GenBank, we needed to generate in silico gene model predictions. We used the genome annotation tool Augustus (Stanke et al. 2006), which requires a hidden Markov model (HMM) gene model. To use HMMs producing gene models that would be as accurate as possible for nonannotated genomes, we selected the most closely related species (species with the most recent last common ancestor) that possessed an annotated RefSeq genome. We then used the Augustus training tool to build an HMM gene model for each genome.
We automated this process by creating a series of python scripts that performed the following tasks: (1) 1.1_insect_database_builder.py: This script collects the NCBI metadata regarding genomes and transcriptomes. Using the NCBI Entrez API, it collects the most up to date information on RefSeq, GenBank, and TSA to generate two CSV files: genome_insect_database.csv and transcriptome_insect_database.csv.
(2) 1.2_data_downloader.py: This is a python wrapper around the rsync tool that downloads the sequence data sets present in the tables created by (1). It automatically downloads all the available information into a local folder.
(3) 1.3_run_augustus_training.py: This is a python wrapper around the Augustus training tool. It uses the metadata gathered using (1) and the sequence information gathered using (2) to build HMM gene models of all RefSeq Blondel et al. . doi:10.1093/molbev/msab284 MBE data sets. It outputs sbatch scripts that can be run either locally, or on a SLURM-managed cluster. Those scripts will create unique HMM gene models per species.
At the time of this analysis (September 2019), 133 insect genomes were collected from the RefSeq database, 309 genomes from the GenBank database, and 1,123 transcriptomes from the TSA database. All the accession numbers and metadata are available in the two tables (genome_insect_database.csv and transcriptome_insect_database.csv) provided in the supplementary files. This pipeline was repeated for crustaceans and the information can be found in the following two files: genome_crustacean_database.csv and transcriptome_crustacean_database.csv.

Creation of Protein Sequence Databases
The classical approach for homology detection compares protein sequences to amino acid HMM corresponding to the gene of interest. Since we used three different NCBI databases, we performed the following preprocessing actions: (1) RefSeq: Well-annotated genomes from NCBI contain gene model translation; no extra processing was required.
(2) GenBank: Using the HMMs created from the RefSeq databases, we created gene models for each GenBank genome using Augustus and a custom HMM gene model. To choose which HMM gene model to use, we selected the one for each insect order that had the highest training accuracy. In the case where an insect order did not have any member in the RefSeq database, we used the model of the most closely related order. We then translated the inferred coding sequences to create a protein database for each genome. The assignment of the models used to infer the proteins of each GenBank genome is available in the Table_S4_models.csv, Supplementary Material online, available through the GitHub repository for this study at https://github.com/extavourlab/Oskar_Evolution. To automate the process, we created a custom python script available in the file 1.4_run_augustus.py. (3) TSA: Transcriptomes were translated using the emboss tool Transeq (Madeira et al. 2019). We used this tool with the default parameters, except for the six-frame translation, trim and clean flags. This generated amino acid sequences for each transcript and each potential reading frame.

Identification of Oskar Homologs
The oskar gene is composed of two conserved domains, LOTUS and OSK, separated by a highly variable interdomain linker sequence (Ahuja and Extavour 2014;Jeske et al. 2015;Yang et al. 2015). To our knowledge, no other gene reported in any domain of life possesses this domain composition (Blondel et al. 2020). Therefore, here we use the same definition of oskar homology as in our previous work: a sequence possessing a LOTUS domain followed by an interdomain region, and then an OSK domain (Blondel et al. 2020). To maximize the number of potential homologs, we searched each sequence with the previously generated HMM for the LOTUS and OSK domains (Blondel et al. 2020). The presence and order of each domain were then verified for each potential hit and only sequences with the previously defined Oskar structure were kept for further processing. We used the HMMER 3.1 tool suite to build the domain HMM (hmmbuild with default parameters), and then searched the generated protein databases (see Creation of Protein Sequence Databases) using those models (hmmsearch with default parameters). Hits with an E-value ! 0.05 were discarded. A summary of all searches performed is compiled in Table_S5_searches.csv Supplementary Material online, in the GitHub repository for this study at https://github.com/extavourlab/Oskar_ Evolution. All the hits were then aligned with hmmalign with default parameters and the HMM of the full-length Oskar alignment previously generated (Blondel et al. 2020). The resulting sequences were automatically processed to remove assembly artifacts, and potential isoforms. This filtration step was automated and went as follows: First, the sequences were grouped by taxon. Then each group of sequences was aligned using MUSCLE (Edgar 2004) with default parameters. The Hamming distance (Hamming 1950), a metric that computes the number of different letters between two strings, between each sequence in the alignment, was computed. If any group of sequences had a Hamming distance of > 80%, then we only kept the sequence with the lowest E-value match. This created a set of sequences containing multiple oskar homologs per species only if they were the likely product of a gene duplication event. We then used the resulting new alignment to generate a new domain HMM and a new full-length Oskar HMM (using hmmbuild with default parameters) and ran further iterations of this detection pipeline until we could detect no new oskar homologs in the available sequence data sets. We called this final set the filtered set of sequences and used it in all subsequent homology analyses unless otherwise specified.
The Oskar sequences obtained are available in the following supplementary files: Oskar_filtered.aligned.fasta, Oskar_filtered.fasta, and Oskar_consensus.hmm.

Correlative Analysis of Assembly Quality and Absence of Oskar
Using the metadata gathered previously from NCBI databases (see Genomes and Transcriptomes Preprocessing) we created two pools of source data: genomes where we identified an oskar sequence, and genomes where we failed to find a sequence that met our homology criteria. We then compared the two distributions for each of the eight available assembly statistics: 1) Contig and 2) Scaffold N50, 3) Contig and 4) Scaffold L50, 5) Contig and 6) Scaffold counts, and 7) Number of Contigs and 8) Scaffolds per genome length.
Evolution of a Cytoplasmic Determinant . doi:10.1093/molbev/msab284 MBE Finally, we performed a MannÀWhitney U statistical analysis to compare the means of the two distributions (see 2.1_Oskar_discovery_quality.ipynb).

TSA Metadata Parsing and Curation
Data sets in the TSA database are associated with a biosample object that contains all the metadata surrounding the RNA sequencing acquisitions. These metadata can include information about one or both the tissue of origin and the organism's developmental stage. We first automated the retrieval of these metadata using a custom python script that used the NCBI Entrez API (see 2.3_Oskar_tissues_stages.ipynb). However, the metadata proved to be complex to parse for the following reasons: 1) not all projects had the data entered in the corresponding tag, 2) some data contained typographical errors, and 3) multiple synonyms were used to describe the same thing with different words in different data sets. We therefore created a custom parsing and cleaning pipeline that corrected mistakes and aggregated them into a cohesive set of unique terms that we thought would be most informative to interpret the presence or absence of oskar homologs (see 2.3_Oskar_tissues_stages.ipynb to see the mapping table). This strategy sacrificed some of the fine-grained information contained in custom metadata (e.g., "right leg" became "leg") but allowed us to analyze the expression of oskar using consistent criteria throughout all the data sets. This pipeline generated, for all available data sets, a table of tissues and developmental stages including oskar presence or absence in the data set (see Oskar_all_tissues_stages.csv).

Dimensionality Reduction of Oskar Alignment Sequence Space
The Oskar alignment was subjected to an MCA. Similar to a PCA, dimension vectors were first computed to maximize the spread of the underlying data in the new dimensions, except that instead of a continuous data set, each variable (here an amino acid at a given position) contributes to the continuous value on that dimension. Once the projection vectors were computed, each sequence was then mapped onto the dimensions. Each amino acid position (column) in the alignment was considered a dimension with a possible value set of 21 (20 amino acids and gap). We first removed the columns of low information (columns that had <30% amino acid occupancy) using trimal (Capella-Gutierrez et al. 2009) with a cutoff parameter set at 0.3. Then, the alignment was decomposed into its eigenvectors, and projected to the first three components. To perform this decomposition, we implemented a previously developed preprocessing method (Rausell et al. 2010) in a python script (see MCA.py and 2.8_Oskar_MCA_Analysis.ipynb) and performed the eigenvector decomposition with the previously developed MCA python library (see Key Resource Table). We ran the same algorithm on the LOTUS domain, OSK domain, and fulllength Oskar alignments obtained above (see Identification of oskar Homologs).

Phylogenetic Inference of Oskar Sequences in the Hymenoptera
We aligned all hymenopteran Oskar sequences using PRANK (Loytynoja 2014) with default parameters. We then manually annotated duplicated sequences by considering two sequences from the same species that had < 80% amino acid identity, as within-species duplications of oskar. We trimmed this alignment to remove all columns with < 50% occupancy using trimal with the cutoff parameter set at 0.5. To reconstruct the phylogeny of these sequences, we used the maximum likelihood inference software RAxML (Stamatakis 2014) with a gamma-distributed protein model, and activated the flag for auto model selection. We ran 100 bootstraps and then visualized and annotated the obtained tree with Ete3 (Huerta-Cepas et al. 2016) in a custom ipython notebook (see 2.7_Oskar_duplication.ipynb).

Calculation of Oskar Conservation Scores
Using the large set of homologous Oskar sequences obtained as described above, we computed different conservation scores for each amino acid position. This methodology relies on the hypothesis that if an amino acid, or its associated chemical properties at a particular position in the sequence are important for the structure and/or function of the protein, they will be conserved across evolution. We considered multiple conservation metrics, each highlighting a particular aspect of the protein's properties as described in the following sections. The scores can be found in the supplementary file scores.csv.

Computation of the Valdar Score
The Valdar score (Valdar 2002) attempts to account for transition probabilities, stereochemical properties, amino acid frequency gaps, and, particularly essential for this study, sequence weighting. Due to the heterogeneity of sequence data set availability, most Oskar sequences occupy only a small portion of insect diversity, primarily Hymenoptera and Diptera. Sequence weighting allows for the normalization of the influence of each sequence on the score based on how many similar sequences are present in the alignment (Valdar 2002). We implemented the algorithm described in Valdar (2002) in a python script (see besse_blondel_conservation_scores.py), then calculated the conservation scores for the Oskar alignment we generated above.

Computation of the Jensen-Shannon Divergence Score
Jensen-Shannon Divergence (JSD) (Lin 1991;Capra and Singh 2007) uses the amino acid and stereochemical properties to infer the "amount" of evolutionary pressure an amino acid position may be subject to. This score uses an information theory approach by measuring how much information (in bits) any position in the alignment brings to the overall alignment (Capra and Singh 2007). This score also takes into account neighboring amino acids in calculating the importance of each amino acid. We used the previously published python code to calculate the JSD of our previously Blondel et al. . doi:10.1093/molbev/msab284 MBE generated Oskar alignment (Capra and Singh 2007) (see score_conservation.py).

Computation of the Conservation Bias
The measure of differences in conservation between the holometabolous and hemimetabolous Oskar sequences presented in the results was done as follows: We first split the alignment into two groups containing the sequences from each clade (see 2.4_Oskar_pgc_specification.ipynb). Due to the high heterogeneity in taxon sampling between hemimetabolous and holometabolous insects, we ran a bootstrapped approximation of the conservation scores on holometabolous sequences. We randomly selected N sequences (N ¼ the number of hemimetabolous sequences), computed the Valdar conservation score (see Computation of the Valdar Score), and stored it. After 1,000 iterations, we computed the mean conservation score for each position for holometabolous sequences. For hemimetabolous sequences, we directly calculated the Valdar score using the method as described above (see Computation of the Valdar Score). For each position, we then computed what we refer to as the "conservation bias" between Holometabola and Hemimetabola by taking the ratio of the log of the conservation score Holometabola and Hemimetabola. Conservation Bias ¼ LogðValdar holo Þ LogðValdar hemi Þ for each position (see 3.4_LogRatio_Bootstrap.ipynb).

Computation of the Electrostatic Conservation Score
To study the conservation of electrostatic properties of the Oskar protein we computed our own implementation of an electrostatic conservation score (see besse_blondel_conserva-tion_scores.py). Aspartic acid and Glutamic acid were given a score of À1, Arginine and Lysine a score of 1, and Histidine a score of 0.5. All other amino acids were given a score of 0. Then, we summed the electrostatic score for each sequence at each position and divided this raw score by the total number of sequences in the alignment. This computation assigns a score between À1 and 1 at each position, À1 being a negative charge conserved across all sequences, and 1 a positive charge.

Computation of the Hydrophobic Conservation Score
To study the conservation of hydrophobic properties of the Oskar protein we implemented our own hydrophobic conservation score (see besse_blondel_conservation_scores.py). At each position, each amino acid was given a hydrophobic score taken from a previously published scoring table (Moon and Fleming 2011). (This table is implemented in the besse_-blondel_conservation_score.py file for simplicity.) Scores at each position were then averaged across all sequences. This metric allowed us to measure the hydrophobicity conservation of each position in the alignment and is bounded between 5.39 and À2.20.
Computation of the RNA Binding Affinity Score RNA binding sites are defined as areas with positively charged residues and hydrophobic residues. To estimate the conservation of RNA binding sites in oskar homologs, we used RNABindR v2.0 (Terribilini et al. 2007), an algorithm predicting putative RNA binding sites based on sequence information only. We automated the calculation for each sequence by writing a python script that submitted a request to the RNABindR web service (see RNABindR_run_predictions.py). We then aggregated all results into a scoring matrix, and averaged the score obtained for each position. We call this score the RNABindR score and hypothesize that it reflects the conservation of RNA binding properties of the protein. Importantly, this score was obtained in 2017 for only a subset of 219 proteins used in this study (indicated in the supplementary files at: 03_Oskar_scores_generation/RNABindR_raw_sources). Since then, the RNABindR server has been defunct and we could not repeat those measurements as the source code for this software is unavailable.

Computation of Secondary Structure Conservation
Due to the overall low conservation of the LOTUS domain, we decided to see whether the secondary structure was conserved. To this end, we used the secondary structure prediction algorithm JPred 4 (Drozdetskiy et al. 2015). Given an amino acid sequence, this tool returns a positional prediction for a-helix, b-sheet or unstructured. We used the JPred4 web servers to compute the predictions and processed them into a secondary structure alignment (see 2.6_Oskar_lotus_osk_structures.ipynb). We then used WebLogo (Crooks et al. 2004) to visualize the conservation of the secondary structure.

Visualization of Conservation Scores
We used PyMOL (DeLano 2002) to map the computed conservation scores onto the solved structures of LOTUS and OSK (Jeske et al. 2015(Jeske et al. , 2017. At the time of writing, no fulllength Oskar protein structure had been reported. With the caveat that all visualization was done on the structure of the D. melanogaster protein domains, we created a custom python script that augments PyMOL with automatic display and coloring capacities. This script is available as Oskar_pymol_visualization.py, and contains a manual at the beginning of the file. For the OSK domain, we used the structure PDBID: 5A4A, and for the LOTUS domain, PDBID: 5NT7 (Jeske et al. 2015(Jeske et al. , 2017. The LOTUS structure we used is in complex with Vasa, and in a dimeric form (Jeske et al. 2017), allowing for easy interpretation of the different conservation scores. For the OSK structure, we removed the residues 399À401 and 604À606 from the PDB file as those amino acids did not align across all sequences and therefore showed highly biased conservation scores.

Statistical Analysis
All statistical analyses were performed using the scipy stats module (https://www.scipy.org/). Significance thresholds for P values were set at 0.05. Statistical tests and P values are reported in the figure legends. All statistical tests can be found in the ipython notebooks mentioned below.

Supplementary Material
Supplementary data are available at Molecular Biology and Evolution online.