Hypersaline Lake Urmia: a potential hotspot for microbial genomic variation

Lake Urmia located in Iran is a hypersaline environment with a salinity of about 27% (w/v). Metagenomic analyses of water samples collected from six locations in the lake exhibited a microbial community dominated by representatives of the family Haloferacaceae (69.8%), mainly those affiliated to only two genera, Haloquadratum (59.3%) and Halonotius (9.1%). Similar to other hypersaline lakes, the bacterial community was dominated by Salinibacter ruber (23.3%). Genomic variation analysis by inspecting single nucleotide variations (SNVs) and insertions/deletions (INDELs) exhibited a high level of SNVs and insertions, most likely through transformation for abundant taxa in the Lake Urmia community. We suggest that the extreme conditions of Lake Urmia and specifically its high ionic concentrations could potentially increase the SNVs and insertions, which can consequently hamper the assembly and genome reconstruction from metagenomic reads of Lake Urmia.

Hypersaline environments, widely distributed across a variety of climate zones, are characterized by high salt concentration and are often aquatic systems (thalassohaline, of marine origin or athalassohaline, formed by dissolution of mineral salt deposits of continental origin) or saline soils 1,2 . Studies on different hypersaline environments show that Haloquadratum and certain Balneolaeota members may preferably grow in aquatic or soil habitats, respectively, while haloarchaea, nanohaloarchaea, and Salinibacter are capable of adapting to both environments 3 . Microbial cells require specific adaptations that enable them to thrive in the extreme conditions of different hypersaline environments 4 . In addition to their intrinsic capabilities, microorganisms may undergo variations to adapt to extreme conditions. Generally, mutation and lateral gene transfer in microbial communities are instrumental in developing adaptive features or fitness-conferring variants 5 . The balance of variation and selection causes the unadapted cells to be outcompeted by thriving microorganisms that are getting selected, which could consequently lead to reduced phylogenetic diversity of the community and potentially an increased rate of variation at the strain level 4 .
Metagenomic studies, focused on the diversity and metabolic capabilities of reconstructed metagenomeassembled genome (MAGs) from environmental samples, have expanded our knowledge about microbial diversity of natural ecosystems, including hypersaline environments 6,7 . However, these studies overlook the withinspecies diversity of microbial inhabitants of these environments. In ecosystems with high microbial diversity at strain level (microdiversity), where the performance of MAG reconstruction is limited, we need to adopt elaborate approaches to study within-species gene content variations, single nucleotide variations (SNVs), and INDEL profiles 8 .
Our knowledge of prokaryotic microdiversity is largely based on comparative genomic analyses of isolates from different locations. Originating from completely different habitats, it is expected for these genomes to vary in fitness backgrounds and genomic adaptations specific to the local conditions. To evaluate the gene content of strains and variations in these genes, pan-genome studies can present valuable information. Pan-genome contains both the core that is present in all organisms belonging to the same species as well as accessory genes, which are not shared in all representatives but could potentially confer important additional capabilities 9 .
In this study, we explore the microdiversity and pan-genome of highly abundant microorganisms in the hypersaline Lake Urmia, an endorheic salt lake located in Northwest Iran. At its greatest extent, Lake Urmia was Overall microbial diversity of Lake Urmia. The distribution of peaks in a length-weighted G+C histogram showed that the G+C content of the raw metagenomic reads and assembled contigs in reconstructed LUMs are consistent. As shown in Fig. 2, discrete peaks at 47% and 67% G+C content of raw metagenomic reads correspond to the G+C content of the reconstructed LUMs (Supplementary Table S2), including Haloquadratum at 47% G+C, and several archaeal and bacterial populations with G+C content at around 67%.
Genome-resolved analysis of the Lake Urmia microbial diversity. The presence of closely related strains of the same species in the ecosystem causes high genetic variation leading to insufficient coverage of each strain that might hamper assembly and consequently MAG reconstruction for such groups 17 . In this study, 204 million sequenced high-quality (Phred quality score > 30) paired-end reads (150 bp), provided acceptable depth but the different abundance of various species in a metagenomic sample resulted in a highly non-uniform read coverage across different genomes. We detected an outstanding level of variations (arising from SNVs and insertions) in the dominant microorganisms, including representatives of the haloarchaeal genera Haloquadratum and Halonotius, and the extremely halophilic bacterium Salinibacter ruber, which is causing a decrease in the completeness of their MAGs reconstructed from the metagenome of Lake Urmia. Binning assembled contigs Figure 1. Location of Lake Urmia in Iran and of the six sampling sites. The map was prepared in R studio, using the ggmap 3.0.1 package (left). View of hypersaline Lake Urmia on October 20th, 2020 (right). Genetic variation of archaeal and bacterial LUMs. We identified 812,009 SNVs of all possible point variations (transition and transversion) in the LUMs and reference genomes with > 1% relative abundance in the metagenome. These variations were prevalent, especially in Haloquadratum, Salinibacter, and Halonotius representatives (the number of SNVs is shown in Fig. 3 and a full description is provided in Supplementary Table S7). In the case of INDELs, the frequency of deletions was low, while insertions were prominent. The length of the insertions ranged up to 354 nucleotides and was frequent in all reads mapped to the reference genomes, especially in Haloquadratum, Salinibacter, and Halonotius representatives (Supplementary Table S8).
DNA sequence variation can be introduced from internal sources (within the cell) during metabolic processes, generating oxidation, hydrolysis, and alkylation damage, along with the incorporation of mismatched bases 18 . Further, inducing agents such as UV and ionizing radiation as well as various chemical mutators may cause the incorporation of base analogs and base lesions, including the deamination, depurination, and methylation of bases, oxidative damage, and DNA double-strand breaks (DSB) 19 . If the repairing mechanisms work well, these are excluded, but if unsuccessful, an SNV is perpetuated; for example, among DSB repair pathways, there are nonhomologous end joining and microhomology-mediated end joining, which respond quickly with a high error rate, causing INDEL and translocations 18 . We hypothesize that high ionic levels of the hypersaline environment may contribute to the high SNVs. For prokaryotes with the "salt in" strategy, high levels of K + , Na + , and Mg 2+ (and other ions) may interfere with the accuracy of archaeal enzymes and the associated chaperones leading to SNVs, however, further empirical proofs are required to confirm this hypothesis.
While the compaction and stabilization of DNA in eukaryotic cells are provided by the function of histone proteins (neutralizing 57% of the DNA negative charge) 20 , the remaining negative charge of DNA is compensated for by cations. The stability of dsDNA increases at higher salinities 21 . As reported by Borin et al. 22 , naked DNA in deep-sea anoxic hypersaline brines, independent of the species of origin, was capable of participating in natural transformation after weeks of exposure. In addition to their role in stabilizing the structure of DNA molecules, cations can increase the melting temperature of DNA molecules. GC-rich regions of the DNA also have a higher melting point. Altogether, it is quite expected that intact, active, and double-stranded standard DNA (which has higher stability compared to single-stranded DNA) from disrupted cells is present in Lake Urmia. A combination www.nature.com/scientificreports/ of cations as mediators, together with the lower temperature of the lake (the average annual temperature is 9 °C) facilitates the suitable folding of DNA into a compact structure. Additionally, it mediates the attachment of the DNA to the negative charges of the cell membrane's phosphatidylcholine and phosphatidylserine, DNA-uptake apparatus, anchoring and stabilizing the interaction of DNA with the membrane 23 . Furthermore, the calcium ions bound to the cell membrane also cause changes in the membrane permeability and facilitates DNA entry into the cell 22 . In addition to standard B-DNA, research has shown that the presence of GC-rich DNA molecules in saline solutions can cause non-standard structures of G-quadruplexes and G-triplex, which have much higher resistance than the standard DNA; however, their role in transformation is not well studied 24 .
As the most dominant prokaryotes of the lake and based on the extent of coverage, breadth, SNV, and INDEL, we focused on the dominant taxa in the community using reference genomes including Haloquadratum walsbyi DSM 16790, Salinibacter ruber ST67, their corresponding LUMs, and Halonotius pteroides CECT 7525.
Genus Haloquadratum. A major percentage of the Lake Urmia metagenomic reads was mapped to Haloquadratum walsbyi DSM 16790 (14.53%), with a minor contribution of Haloquadratum walsbyi J07HQW2 (0.14%), Haloquadratum walsbyi C23 (0.08%) and Haloquadratum sp. J07HQX50 (0.05%). The square-shaped halophilic Haloquadratum walsbyi, with a G+C content of 47.9%, is a member of the family Halobacteriaceae and is the most prevalent archaeon found in saline environments as documented in previous studies 25 . This archaeon can grow optimally at 23-30% (w/v) total salt concentrations, is tolerant to very high magnesium levels, and as a photoheterotroph, it can obtain energy from light absorbed by bacteriorhodopsin 26 . Possessing two 3.1 Mb chromosomes, many deletions, short direct repeats (4-20 bp), and insertions have been reported by Dyall-Smith et al. 18 , conferring that the uptake and integration of foreign DNA have contributed to the evolution of Haloquadratum walsbyi genome. Consistent with GATK 4.2, which yielded a high density of genomic variation, a strain-level study of Haloquadratum walsbyi using inStrain 1.5.7, exhibits a high level of SNVs. Due to the high abundance of this archaeon, the high number of SNVs and reception of external DNA from not only intra-species but also inter-species sources, such as Haloarcula sp. CBA1115, Haloplanus rubicundus, or the halophilic archaeon DL31, the genetic diversity of this microorganism in Lake Urmia is very high. Based on peaks in the SNV distribution at about 63%, and 83%, it seems that there are two other strains present in the sample which can be justified due to the existence of LUM.47 and LUM.48 (Fig. 4a). Our results showed that most of the detected insertion sequences originated from Haloquadratum walsbyi C23. Moreover, a plasmid from Haloplanus rubicundus, a 63 bp insertion from the chromosome of the halophilic archaeon DL31, and two identical insertions of 223 bp from Haloarcula sp. CBA1115, and a 49 bp insertion from the chromosome of Salinarchaeum sp. IM2453 were also spotted in the metagenomic reads that mapped to the Haloquadratum walsbyi genomes.
LUM.22, LUM.47, and LUM.48 retrieved 11.69%, 0.25%, and 0.36% of the mapped reads, and are affiliated to Haloquadratum walsbyi, Haloquadratum sp., and Haloquadratum walsbyi_A, respectively. The completeness of these LUMs is respectively, 59.11%, 85.61%, and 73.77%. Numerous detected SNVs and INDELS have yielded a pool of genetic variation for Haloquadratum walsbyi in Lake Urmia leading to disruption of the assembly process and consequently low completeness of representative MAGs of highly abundant taxa in the lake.
The overall variation of LUM.22 is lower than that of Haloquadratum walsbyi DSM 16790 (Fig. 4b) since during the assembly, reads with SNV or insertion may not overlap to produce a contig and therefore will not be associated in the final LUM. Our results showed some insertions from Haloquadratum walsbyi and a 36 bp insertion from Halorubrum sp. PV6 plasmid in LUM. 22. Genus Halonotius. Representatives of the next most abundant genus, include Halonotius pteroides CECT 7525 (6.8%), Halonotius sp. J07HN4 (1.2%), and Halonotius sp. J07HN6 (3.96%). Halonotius has a worldwide environmental distribution and has been reported in several habitats, such as solar salterns in Turkey 27 and Spain 28 , saline lakes in Australia 29 , and China 30 , and food-grade salt samples 31 .
Genus Salinibacter. Salinibacter ruber ST67 (1.19%), Salinibacter ruber M1 (0.88%), Salinibacter ruber SP2521 (0.68%), and Salinibacter ruber SP273 (0.67%), are reference representatives of the next most prevalent genus determined on metagenome of the Lake Urmia. Because of its broad distribution and high intraspecific genomic and functional diversity at both transcriptomic and metabolomic levels, this bacterium can be considered one of the main models for ecological and evolutionary studies of bacterial adaptation to hypersaline environments 32 .
Variant calling by GATK 4.2, showed various types of SNVs (Fig. 4d) with a very high frequency of insertions. A remarkable point of Salinibacter ruber variation was the intra-species insertions originating from the Haloquadratum walsbyi and the high frequency of intra-species plasmid integration. Our results are in agreement with those of González-Torres and Gabaldón 32 , who reported a highly variable accessory genome in Salinibacter ruber and highlighted the impacts of horizontal gene transfer (HGT) and homologous recombination (HR) processes 32  www.nature.com/scientificreports/ LUM.80 is taxonomically affiliated to Salinibacter ruber ST67 with ANI (Average nucleotide identity) of 98.60% (and to Salinibacter ruber SP2521 with ANI of 98.52%), with a G+C content of 67.26%, completeness of 75.87%, and relative abundance of 1.25%, which its overall variation is lower than that of Salinibacter ruber ST67 (Fig. 4e).
Other genomes detected in the metagenome have a relative abundance of lower than 1% and were not the focus of our analysis to explore their genomic variations. DNA exchange mechanisms in Lake Urmia prokaryotes. Three major mechanisms of DNA exchange in prokaryotes include natural transformation, transduction, and conjugation 33 . To assess the role of transduction in DNA exchange in Lake Urmia we used VirSorter2 34 . VirSorter2 34 provided viral sequences that made up a small proportion (0.063%) of total sequences within the Lake Urmia metagenome (Supplementary Table S9). In total, 40 viral sequences (including 24 double-stranded DNA phage and 16 single-stranded DNA viruse affiliated ccontigs) were identified, among which 21 sequences could be taxonomically classified by using the genome detective virus tool (v 1.133). A total of 15 viral sequences were similar to previously reported haloviruses and accounted for 0.013% of the total metagenome. In addition, two Natrialba phage PhiCh1, two Archaeal BJ1 virus, one Haloarcula hispanica pleomorphic phage, and one Halorubrum phage HF2, accounting for 0.002% of the metagenome were identified. Nineteen viral sequences (accounting for 0.048% of the metagenome) were not assigned to other previously reported viruses. To evaluate the host-virus relationships, tRNAscan-SE 2.0, PHASTER (Enhanced Release), and minCED (0.4.2) tools detected no tRNA and CRISPR sequences within the LUMs presenting no evidence for host prediction and limited our ability to further explore the role of transduction.
To assess the prevalence of conjugation, insertion sequences were evaluated for their affiliation to plasmid sequences. Although some plasmid sequences were identified, the very short size of these sequences suggests a limited possibility of whole plasmid conjugation.
For transformation, as described earlier, stable DNA fragments present in Lake Urmia could potentially pass through the cell envelope, composing of the S layer and the cell wall. A high concentration of cations can react with the negative charges of the cell envelope, increasing its stability and reducing permeability to prevent DNA entry 35 . However, during chromosome duplication and cytoplasm separation, the hydrolytic enzymes such as pseudomurein endoisopeptidases PeiW and PeiP determined in Methanothermobacter strains and archaeosortase detected in all archaea (KO K24447, also detected in Lake Urmia metagenome) degrade the cell envelope, causing cells to have an absent or defected cell envelop. This could be posed as an opportunity for DNA to bypass the cell envelope and transfect the cell 36 . When entering the cell, DNA can be a substrate for restriction enzymes. However, specific restriction enzymes of the cells can distinguish exogenous DNA by considering specific DNA methylation patterns of the host. Host-mimicking DNA (possessing the same DNA methylation pattern) can cause the exogenous DNA not to be recognized by the restriction enzymes. The nucleotide similarity of the exogenous DNA can be a signal to recruit the ubiquitous RecA instead of the nucleases. The incoming exogenous DNA can be replaced in the host genome by homologous recombination 37 . Because DNA present in the lake originates from strains of a few genera with similar methylation pattern, in case of insertion into the living cells, they may be used as beneficial sources to repair the damaged DNA sequences of the host cell via recombination 37 . This hypothesis is consistent with our results. As shown in Supplementary Table S8, a highly significant number of the insertions in Haloquadratum walsbyi, Halovenus, and Salinibacter ruber originated from various strains of the species (most likely with similar methylation patterns) 38 . The other immunity mechanism is the CRISPR-Cas in which the base pair complementarity of the crRNA with the exogenous DNA triggers nucleases to degrade foreign DNA 38 . However, if there is even a point mutation (which is abundant in the genomes of the Lake Urmia) or base substitution (like cytosine to 5-hydroxymethylcytosine) the complementarity structure is not formed resulting in nullifying the CRISPR-Cas 39 . We presented some bypassing mechanisms of the host cell immune response but regarding the length of the insertions (up to 354 nucleotides), it seems that the exogenous DNA has been unsuccessful in evading the immunity system of the cell. To explain this, it can be hypothesized that the exogenous DNA is either degraded in the environment to smaller sequences that can pass the membrane to be integrated, or the large DNA sequences can enter the cell but are degraded to small sequences before integration. Given that small DNA sequences can pass the membrane more efficiently, the former hypothesis may better illustrate the transformation mechanism in Lake Urmia.
Functional assessment of the core and accessory genes of Haloquadratum walsbyi. The sequences representing the core and accessory clusters of Haloquadratum walsbyi were 576 and 178 KOs (organized in Orthologs, modules, and networks), respectively. Annotated genes of the core components could be organized in 11 KEGG modules covering all essential proteins in carbohydrate, energy, nucleotide, amino acid, cofactors, vitamins, terpenoids, and polyketides metabolism (Supplementary Table S12). The genes in the auxiliary components did not form any complete module. Our result showed that the genes involved in the prokaryotic defense system (restriction and modification system and toxin-antitoxin system) and RNA polymerase are exclusively present in the core signifying their crucial roles. Restriction and modification systems include type I restriction enzyme, DNA methyltransferases, adenine-specific DNA-methyltransferase, and modification methylase. Detailed KEGG analysis of the core and auxiliary components is presented in Supplementary Tables S10-S12. Further, various chaperonins including heat shock proteins HSP60, heat shock proteins GimC, folding catalyst dnaJ (which play a role in stress response), and the proteasome responsible for folding, sorting, and degradation of the mistranslated, misfolded, and damaged proteins 40 are mostly organized in the core.

Conclusion
To understand the community structure and microdiversity of Lake Urmia (as an extreme habitat with a high concentration of ions), we reconstructed MAGs and analyzed various modes of genetic variation in them.
In terms of horizontal gene transfer processes, transformation was detected to be the main strategy of DNA insertion. Our analysis showed that bacteria and archaea exhibit high inter/intra-genera gene exchange and metagenomic analysis revealed a high abundance of Haloquadratum walsbyi, Halonotius, and Salinibacter ruber representing high SNVs leading to challenges in reconstructing complete MAGs from their representatives in the Lake Urmia metagenomes due to the very high level of microdiversity. Finally, we suggest that the high ionic concentrations of such hypersaline ecosystems might play a role in the microdiversity profile of its highly abundant taxa where more empirical analyses will be required to fully clarify its role.

Methods
Geographic description of the study site, physico-chemical analysis, and sampling procedures. The lake is located between the provinces of East Azerbaijan and West Azerbaijan in Iran, and West of the southern portion of the Caspian Sea. In coordination with the Iranian Department of Environment, samples were collected on October 20th, 2020. Approximately 20 L of water was collected in sterile containers at 20 cm depth from six different locations along the vertical transect of Lake Urmia as is shown in Fig. 1. Due to the very low water levels in the southern part of the lake, no sample was collected from this part. Samples were kept cold until further analysis. For physico-chemical analysis of the samples, standard methods based on American Public Health Association (APHA) were used 41 .
Environmental DNA extraction and quality control. To collect biomass for metagenomic sequencing, we removed larger particles and eukaryotes such as Dunaliella salina, by pre-filtering samples through 3-μm filters (cellulose-nitrate, Millipore). For environmental DNA extraction, we mixed the water samples (120 L) and used 40 L for biomass collection using two strategies. Ten liters of the sample were centrifuged at 4500 rpm (3260×g) for 60 min, and retentate was used for DNA extraction using QIAprep® Miniprep (Qiagen). The second strategy was biomass collection on polycarbonate membrane filters with 0.22-μm pore size (Isopore Membrane Filter, Isopore™ Millipore), and finally DNA extraction from filters using Qiagen DNeasy (Qiagen). The quantity and quality of the extracted DNA were analyzed by a NanoDrop™ One C Microvolume UV-Vis Spectrophotometer and agarose gel electrophoresis.
Sequencing and assembly. The purified environmental DNA was sequenced using Illumina NovaSeq 6000 platform at Novogene Co. Ltd (China) as a paired-end (PE150) library. Metagenomic raw reads were quality-checked using FastQC 0.11.9 42 and trimmed using Trimmomatic 43 . The paired-end sequences were merged using BBtools, reformat.sh script (sourceforge.net/projects/bbmap/). For preliminary taxonomic profiling, the raw reads of the metagenomic sample were analyzed using MetaPhlAn3 15 with a database of 1.1 M markers using bowtie2 44 . The trimmed read sets were assembled using MEGAHIT (v1.0.3) 45 with paired-end mode, k min = 49, k max = 149, k step = 10. To evaluate assembly quality, QUAST 46 was applied. For metagenome diversity, rarefaction analysis based on phylogenetic reads was performed using MG-RAST server 4.0.3 47 .
Detection and phylogenetic analysis of ribosomal RNA in the metagenome assemblies. To retrieve ribosomal RNA genes from the metagenomic assemblies, barrnap v0.9 16 was used which predicts the location and sequence of ribosomal RNA genes in genomes and supports both Bacteria (5S, 23S, and 16S rRNA) and Archaea (5S, 5.8S, 23S, and 16S rRNA). The 16S rRNA sequences were checked against three databases including NCBI, EzBioCloud, and the genome taxonomy database (GTDB) 16S rRNA sequences (release 202) using NCBI-BLAST+PACKAGE. For phylogram construction, Ngphylogeny.fr 48 followed by iTOL v6 49 were used.
Reconstructing Lake Urmia MAGs, quality check, and taxonomy assignment. Binning of assembled contigs for reconstructing metagenome-assembled genomes (MAGs) was carried out for contigs ≥ 2.5 kb using MetaBAT2 50 . Mapping metagenomic reads to the assembly were performed using Bowtie2 44 , with the setting (-local-sensitive). The completeness and contamination of the reconstructed Lake Urmia MAGs (abbreviated as LUM) were evaluated using CheckM (v1.1.3) 51 . LUMs with CheckM-completeness ≥ 40% and contamination < 5% were selected for further analysis. For contamination estimation by 16S rRNA genes and other statistics, ContEst16S 52 was used. Taxonomic assignment of the LUMs was performed using GTDB-tk v1.3 (Pierre-Alain Chaumeil, 2020) and average nucleotide identities were calculated using FastANI v1.33 53 . Additionally, 16S rRNA gene sequences were extracted from representative LUMs using Barrnap v0.9 and analyzed using NCBI-BLAST+PACKAGE. To map the community structure of the metagenome, Bowtie 2, using -localsensitive setting was used to a reference genome file containing all 22 LUMs, and publicly available genomes including 132 bacterial, 152 archaeal, and 10 viral genomes (in total 294 genomes). Results were processed using SAMtools-1.14 followed by CoverM (0.6.1) to analyze the community structure of the Lake Urmia metagenome.

Identification of viral contigs.
To retrieve viral contigs associated with the metagenome and the LUMs, we used VirSorter2 34 . Viral sequences with a max score of 1.00 underwent the following analyses. For the taxonomic assignment, the retrieved sequences were analyzed by genome detective virus tool version 1.133 (which is based on the viral Refseq protein database from NCBI). tRNAs were analyzed with tRNAscan-SE 2.0 54 , for