Phosphonate production by marine microbes: Exploring new sources and potential function

Significance Phosphonates are a class of phosphorus metabolites characterized by a highly stable C-P bond. Phosphonates accumulate to high concentrations in seawater, fuel a large fraction of marine methane production, and serve as a source of phosphorus to microbes inhabiting nutrient-limited regions of the oligotrophic ocean. Here, we show that 15% of all bacterioplankton in the surface ocean have genes phosphonate synthesis and that most belong to the abundant groups Prochlorococcus and SAR11. Genomic and chemical evidence suggests that phosphonates are incorporated into cell-surface phosphonoglycoproteins that may act to mitigate cell mortality by grazing and viral lysis. These results underscore the large global biogeochemical impact of relatively rare but highly expressed traits in numerically abundant groups of marine bacteria.


12).
Phosphonate producers had similar abundances at both HOT and BATS and in other samples from the global ocean. For SAR11 and total bacterioplankton these abundances generally increased with depth and were highest below the subsurface deep chlorophyll maximum layer (> 200 meters). Prochlorococcus phosphonate producers were at highest abundance in the surface layer (approximately 10 meters) at HOT and BATS and decreased in abundance with depth. In the surface layer at HOT, approximately 6-9% of Prochlorococcus genomes, 10% of SAR11 genomes, and 10% of total bacterioplankton genomes had a phosphonate production gene cluster. At BATS, a significant seasonal effect on the abundance of all phosphonate producers in the surface layer that temporally corresponds with the recurring deepening of the mixed layer during winter months was observed (Fig. S8). For example, Prochlorococcus phosphonate producers are most abundant from February to May then decline to negligible levels in the summer at the onset of peak water column restratification. The seasonality of SAR11 producers and the total community producers was tightly coupled, suggesting that peak producer abundance occurs during periods with the deepest mixed layer (November through the following March). This pattern follows the seasonal succession of ecotypes at BATS where SAR11 clades IIa and IIb are most abundant in the upper 200 meters during the winter (13). SAR11 clades IIa and IIb were also the most abundant producers in GORG-tropics and GORG-BATS after the surface clade Ia, and SAR11 IIb abundance showed a strong positive correlation with phosphonate producers in the global dataset.

Biotic and abiotic covariates shaping the global distribution of phosphonate producers
For Prochlorococcus producers we find that a small, but significant, amount of variation is explained by phosphate concentrations, and the modeled climatological mean of dissolved organic phosphorus (Fig.  S6, Fig. S9). The models suggest that on average Prochlorococcus phosphonate producers are most abundant where phosphate and DOP concentrations are highest. For example, the North Atlantic and Mediterranean had the lowest median abundance of Prochlorococcus producers. In the deeply sequenced North Atlantic sample GORG-BATS we detected no Prochlorococcus phosphonate producer genomes and significantly more Prochlorococcus phosphonate consumer genomes than in the rest of the global samples. This pattern mirrors the overall depletion of phosphate relative to other nutrients in the North Atlantic (8). We interpret this as reflecting potential selection against the phosphonate biosynthesis trait in Prochlorococcus populations from consistently P-limited regions, such as the tropical North Atlantic, due to the additional P cost of phosphonate biosynthesis. The largest proportion of variation for Prochlorococcus phosphonate producers is explained by the total abundance of Prochlorococcus. That is, Prochlorococcus phosphonate producers are more abundant when Prochlorococcus constitutes a smaller fraction of the total microbial community. It is unclear what ecological mechanisms might underlie this pattern.
The relative abundances of SAR11 phosphonate producers and total bacterioplankton producers were strongly positively correlated, consistent with SAR11 being the dominant phosphonate producer in GORG-tropics and GORG-BATS. Thus, the relative abundances of SAR11 producers and total producers were shaped by many of the same factors. A small but significant amount of variation in SAR11 phosphonate producers was explained by increasing climatological mean nitrate concentrations and dissolved aluminum concentrations. This reflects that the highest average abundance of SAR11 producers occurred in the North Atlantic and Mediterranean Sea, regions which have higher than average dissolved aluminum concentrations (14) and climatological mean nitrate concentrations in the upper 100 meters. Generally, we find that the greatest amount of phosphonate producer variation is explained by multiple depth-dependent biotic factors implying that increasing depth is a unifying master parameter. Total bacterial/archaeal and SAR11 phosphonate producers increase alongside depth-associated biotic factors including low light Prochlorococcus and the abundance of mesopelagic groups like Archaea (15) and SAR11 IIb (16). Phosphonate producers are negatively correlated with known surface clades like SARII IV and V and abiotic factors like dissolved manganese concentrations (manganese being a surface enriched element (17)). These modeling results offer a prediction that bulk phosphonate concentration is greatest in the lower reaches of the euphotic zone and upper reaches of the mesopelagic. Future chemical studies targeting the 100-300 meters depth range will hopefully provide greater understanding of phosphonate cycling in the upper ocean.
In the surface oligotrophic ocean -where Prochlorococcus thrives -phosphonates constitute about 20% of the HMWDOP pool (18) which itself is ~ 25-30% of the DOP pool (based on carbon measurements) (19). Assuming a mean concentration of 200 nM for DOP in the surface of oligotrophic areas (20) then HMWDOP and phosphonate concentrations are 50 nM and 10 nM respectively. With a phosphonate residence time estimated to be between 1-2 years (we used a mean value of 1.5 years) (21,22) this corresponds to a global phosphonate production rate of 6.7 nM y -1 .
MGI Thaumarchaeota generally occur at low abundance in the euphotic zone (23) but rapidly increase in abundance in the upper mesopelagic where they can account for nearly 40% of all DNA staining bacterioplankton (22)(23)(24). Since our interest is the euphotic of the oligotrophic gyres we assume a mean MGI Thaumarchaeota abundance in this depth range (upper 200 meters) of 5×10 4 cells L -1 (24)(25)(26) (28,29) and that 25% of cellular P is in the form of phosphonates (29), global annual phosphonate production for MGI can then be calculated as 0.009 nM y -1 which would account for 0.15% of the total phosphonate production in the euphotic zone of the gyres. For additional context, the entire cellular P production rate of MGI in the euphotic zone would account for approximately 0.65% of the annual phosphonate budget. Thus, MGI alone cannot account for the observed phosphonate concentrations in the euphotic zone of the oligotrophic gyres. However, MGI are likely important phosphonate producers in the mesopelagic zone where their abundance becomes substantially higher.
We also performed a similar calculation for Trichodesmium following the calculations from Dyhrman (30). The raft morphology of Trichodesmium can occur at densities of 60 colonies m -3 in non-bloom conditions (31). Extrapolating from phosphonate measurements from Trichodesmium in culture and in the field (30) this results in a standing stocking stock of 0.013 mol L -1 of phosphonate in Trichodesmium. This is a very coarse estimate because there is high variability in P content per colony and colony abundance across the global tropical and subtropical gyres. Furthermore, these numbers are parameterized from the western North Atlantic Ocean which may not be representative of other regions. Assuming a specific growth rate of 0.25 d -1 , which is the maximum observed in culture (33), this would result in an annual production of 1.19 nM y -1 or ~17% of the global average phosphonate production rate. Likewise, assuming that Trichodesmium fixes 60 Tg new N y -1 across the ocean (34), that fixed N is its sole N source, and that the N:P ratio of Trichodesmium is 30 (35) would result in 1.43 ×10 11 mol P y -1 production. If 10% of cellular P exists in the form of phosphonate then this would account for about 19% of the observed phosphonate budget in the gyres.
Prochlorococcus is responsible for fixing 4 GT C y -1 which represents 0.33 x 10 15 mol C y -1 (35). We identified about 6% of Prochlorococcus as putative phosphonate producers. Since we found no relationship between clade and pepM we assume that this group behaves like Prochlorococcus SB with regard to phosphonate and like an average Prochlorococcus with regards to all the other metabolisms. Therefore, the phosphonate producer strains are responsible for the fixation of 0.02 x 10 15 mol C y -1 . Assuming all Prochlorococcus phosphonate producers have a similar C/P ratio than Prochlorococcus SB i.e. C/P = 156/1 (36) then they fix 12.8 x 10 10 mol P y -1 . With 40% of the P allocated to phosphonates (our study) this yields a production of 5.1 x 10 10 mol Phn y -1 . As Prochlorococcus inhabits the top 200 m of oligotrophic gyres, Prochlorococcus could produce 4.6 nM Phn y -1 sustaining ~69% of the phosphonate yearly production. We can also estimate the contribution to the phosphonate production from field measurements. At a mean concentration of Prochlorococcus of 4.4×10 7 cells L -1 in the euphotic zone between 30N and 30S (37), assuming that 6% of Prochlorococcus cells are phosphonate producers with a cellular P quota of 4.2×10 -18 mol cell -1 (observed in natural populations from equatorial North Atlantic with C:P ratio = 200-800) (38), and a specific growth rate of 0.4 d -1 (39) we can calculate that Prochlorococcus accounts for 10% of the phosphonate budget in the euphotic zone of the oligotrophic gyres. Thus, it is plausible that Prochlorococcus could account for between 10% to 70% of the global euphotic zone phosphonate production.
A similar calculation can be done for SAR11. SAR11 has an estimated number of cells in the global ocean of 5×10 8 cells L -1 (40) or ~5.5 x 10 27 cells in the top 200 m of the oligotrophic ocean where SAR11 is most abundant (41). Based on our study, 18% or 9.9 x 10 26 cells of SAR11 that are putative phosphonate producers. The P quota of SAR11 growing exponentially in culture, is 0.16×10 -16 mol P cell -1 (42). With a specific growth rate of 0.1 d -1 , SAR11 phosphonate producers are responsible for the yearly production of 57.8×10 10 mol P y -1 . We do not have an experimental estimate of the amount of P allocated to phosphonate production in SAR11. However, even with a phosphonate allocation of only 5% SAR11 could produce 2.9 x 10 10 mol Phn y -1 which corresponds to a production of 2.6 nM Phn y -1 and could sustain ~40% of the yearly phosphonate production. With a phosphonate cellular abundance of 10%, comparable to Trichodesmium erythraeum, SAR11 could sustain ~80% of the yearly phosphonate production. Using data from cells collected from the tropical North Atlantic Ocean where SAR11 has a mean cellular P quota of 0.065×10 -16 mol P cell -1 (43), a specific growth rate of 0.1 d -1 (44) and a standing stock of 5×10 8 cells L -1 (44) we estimate that SAR11 accounts for ~15% of the euphotic zone phosphonate production if 5% of its cellular P is allocated to phosphonates.

Evidence for phosphonylated surface layer structures by Prochlorococcus SB
The phosphonate biosynthesis gene cluster contains genes predicted to be involved in cell wall/membrane/envelope biogenesis and extracellular polysaccharide biosynthetic processes (Fig. S5), suggesting that Prochlorococcus SB may produce extracellular saccharides modified with phosphonate groups. We also found that genes coding for glycosylation reactions were statistically more abundant within ± 5 Kbp of the phosphonate biosynthesis cluster than in the remainder of the complete genomes from Prochlorococcus SB and Pelagibacter sp. HTCC7217 and RS40 (Table S5). Many of these genes encode glycosylation enzymes involved in cell membrane/envelope biogenesis (Fig. S5). However, our chemical data shows that phosphonates are associated with a high molecular weight protein fraction. We propose that, in Prochlorococcus SB, phosphonates serve as molecular decorations of glycan chains bound to high molecular membrane-anchored proteins. In glycoproteins sugar polymer chains are covalently attached to amino acid side-chains of a parent protein through post-translational modification.
Glycoproteins were once thought to be exclusive to eukaryotes, but recent efforts indicate that they are ubiquitous in bacteria and archaea (45). In bacteria glycans are attached to specific target proteins via both N-and O-linked glycosylation (46). Protein substrates for glycosylation reactions include the motor protein flagellins, pilins found in bacterial pilus structures (46), and S-layers which are two-dimensional paracrystalline structures that cover the outer surface of bacteria and archaea (47). Protein glycosylation occurs via two distinct mechanisms: the direct attachment of glycan monomers or small chains to proteins via glycosyltransferases (common in flagellins and adhesins) and the pre assembly of glycan onto lipid carriers followed by transfer en masse to a protein acceptor (48). Recently, it has been discovered that many bacteria also possess general O-linked glycosylation systems capable of promiscuous modification of numerous different types of membrane associated proteins (49)(50)(51).
In Prochlorococcus SB, we find genes with homologies to both S-layer biosynthesis and the general Olinked glycosylation system. S-layer biosynthetic clusters encode genes for both biosynthesis and export of lipopolysaccharide anchors (47) and the biosynthesis and export of an extracellular glycoprotein containing the critical Ca 2+ -binding hemolysin domain (52,53). In the Prochlorococcus SB genome between 1.15 -1.17 Mbp we find genes encoding putative surface antigens of the filamentous hemagglutinin-glycoprotein family, Ca 2+ -binding hemolysin domain proteins, and a type-I secretion system ABC transporter, and other antigen-polysaccharide repeats (Table S2). Most compelling, however, is the presence of general O-linked glycosylation reaction enzymes adjacent to the Prochlorococcus SB, Pelagibacter HTCC7217 and Pelagibacter RS40 phosphonate biosynthesis clusters (Fig. S5). In Prochlorococcus SB we find homologs to known genes involved in extracellular capsule synthesis/export, glycan chain extension reactions, and glycosyl transferase reactions. The presence of these capsular biosynthesis enzymes is consistent with the significant cross-talk between lipopolysaccharide biosynthesis and glycoprotein biosynthesis (48) and the shared glycan substrates and enzymes between both pathways (54). Most importantly we identify an initiating glycotransferase (PgIC) which links the first glycan subunit to a lipid carrier, a flippase (Wzx) which translocates the assembled glycan-bound lipid carrier to the periplasmic face, and an O-oligosaccharyltransferase (PgIL) which is the critical enzyme moving the assembled glycan chain to the final acceptor protein (54,55). The striking mutual exclusivity we observe between phosphonate production and consumption genes in marine genomes may also be related to the molecular mechanisms and subcellular location of the machinery from both pathways. For example, both the catalytic domains of C-P lyase (56) and the assembly of glycans (45) occur on the cytoplasmic side of the bacterial inner membrane. A cell simultaneously expressing both these systems might cannibalize its own phosphonates. Taken together our findings strongly suggest that Prochlorococcus SB has the genomic potential to produce surface-expressed phosphonoglycoproteins.

Genomic data sources:
We used two collections of genomes for the comparative genomics analysis: MARMICRODB (56) (https://zenodo.org/record/3520509) and the Global Ocean Reference Genomes (GORG) Tropics dataset (57). MARMICRODB contains over 18000 archaeal, bacterial, eukaryotic, and viral genomes from predominantly the marine environment, but also terrestrial and host-associated systems. GORG Tropics consists of approximately 13000 single-cell genomes sequenced from 28 samples across the tropical surface ocean. Roughly 6000 genomes from GORG Tropics come from a single sample (GORG BATS) from the Sargasso Sea. For quantitative analysis of the complete GORG dataset we randomly subsampled GORG BATS (Sample SWC-09) genomes to the median genomes per sample (N=241) from the remainder of GORG Tropics.

Homology searches:
We identified phosphonate biosynthesis and catabolism proteins by homology to a collection of hidden Markov models using HMMERv3.1b2 (http://hmmer.org/) and the trusted cutoffs of each individual model. pepM peptide sequences were initially identified using PF13714 (PEP_mutase) and then aligned to the Tigrfam (58) Hidden Markov Model (HMM) TIGR02320 using hmmalign. We identified authentic pepM peptide sequences by strict presence of conserved active site motif "EDK(X)5NS" or if they lacked at most two active site residues but were located adjacent (within 5000 nucleotides upstream/downstream) to a genes coding for ppd (TIGR03297) or mpnS (59)(60)(61)(62). The PF13714 model identified many sequences lacking the PepM active site motif, and with significant similarity to the related isocitrate lyase superfamily PF00463. Sequences passing the PF13714 bitscore cutoffs, but lacking the "EDK(X)5NS" motif were classified as members of the isocitrate lyase superfamily. We identified peptide sequences from authentic mpnS and the related hepDI/hepDII genes using custom HMMs built from alignments of curated sequences identified as containing essential catalytic residues (63). We constructed these alignments using MAFFT v7.273 (64) in E-INS-i iterative refinement mode. We trimmed the alignment to peptide positions 317-792 to isolate the core catalytic residues and used to construct HMMs. We identified mpnS, hepDI, and hepDII sequences in MARMICRODB and GORG-tropics using the resulting HMMs (evalue < 0.05), and aligned the resulting sequences using MAFFT. We manually inspected the alignments to remove sequences lacking the essential catalytic residues and realigned them iteratively after we removed poorly-aligned sequences.
We also identified two sets of ten highly conserved, single-copy families from Prochlorococcus and SAR11 lineages for use in normalization with metagenome profiling. These gene families had the ten lowest dN/dS ratios of the entire pangenome, that is the highest purifying/stabilizing selection, for Prochlorococcus or SAR11, and should thus be highly specific and sensitive for metagenomic quantification. Prochlorococcus and SAR11 pangenomes were characterized and defined using PanX (67). To profile metagenomes for total bacterioplankton, we used fetchMG from the mOTUs2 tool (68). Briefly, we identified ten COG families (COG0012, COG0016, COG0018, COG0172, COG0215, COG0495, COG0525, COG0533, COG0541, COG0552) from all bacterial and archaeal genomes in MARMICRODB. To reduce redundancy, we clustered protein sequences from each COG at 90% sequence identity using MMseqs2 v4e23d (69) and used the 90% clustered sequence representatives in metagenome search databases. The HMM protein sequence profiles, results from these searches, and descriptions of the sequence families are available from https://github.com/slhogle/phosphonates.

Multiple sequence alignments, phylogenetic inference, and topological comparison:
We aligned authentic pepM peptide sequences containing the "EDK(X)5NS" motif to TIGR02320 using hmmalign and trimmed using trimAl v1.4.rev22 (70) with the automated -gappyout option, and we manually inspected the alignments to ensure the veracity of the "EDK(X)5NS" motif. We created the multiphylum genome phylogenies from genome assemblies with confirmed phosphonate biosynthesis pathway or at least one confirmed phosphonate catabolic pathway. We used the GTDB-Tk v1.3.0 pipeline (71) with default settings and the GTDB R05-RS95 database (72) to identify conserved proteins (120 bacterial proteins/122 archaeal proteins) and generate concatenated multi-protein alignments. We filtered alignment columns using the bacterial and archaeal alignment masks from (http://gtdb.ecogenomic.org/downloads). We then removed columns represented by fewer than 50% of all taxa and/or columns with no single amino acid residue occurring at a frequency greater than 25%. We further trimmed the alignments using trimAl with the automated -gappyout option to trim columns based on their gap distribution. The multi-phylum genome phylogenies and the PepM phylogeny were inferred using FastTree v2.1.10 (73) under the GAMMA model of rate heterogeneity and the WAG+ substitution model (74). Support values were determined using 100 non-parametric bootstrap replicates. Both the PepM tree and genome tree were left unrooted. Phylogenies and associated data were visualized using ggtree (75).

Gene enrichment analysis:
We annotated the 16 Prochlorococcus and 22 SAR11/Pelagibacterales genomes containing verified pepM sequences with eggNOG 4.5.1 (76) using eggNOG-Mapper v1.0.3-3-g3e22728 (89). We then collected the resulting KEGG orthology annotations (77) from all genes and tested for enrichment of modules and pathways in the KEGG hierarchy in genes within 10000 nucleotides upstream/downstream of the pepM sequence start and stop coordinates respectively. Significant enrichment of KEGG categories were determined using the hypergeometric test (78) implemented in clusterProfiler v3.8 (79). We excluded genomes from the analysis that had contig/scaffold breaks within 10000 nucleotides upstream/downstream of the pepM sequence.

Identification of pepM sequences within Prochlorococcus genomic islands:
We predicted genomic islands using Hidden Markov Models trained from conserved gene synteny patterns in closed Prochlorococcus genomes as described before (80). Briefly, we assume genomic islands to be contiguous stretches of DNA enriched in flexible genes families (i.e. genes found in a subset of all genomes). We then used previously described genomic islands in six Prochlorococcus genomes (81) to define core, flexible, and inconclusive gene "states" in four distinct hidden markov models, each with two hidden states (island or non-island). We then used empirical gene family frequencies as a proxy for the core/flexible/inconclusive state of each gene and the gene order on each scaffold as input to the Viterbi algorithm for predicting the hidden island state of each gene in all other Prochlorococcus genomes.

Metagenome read classification:
We created Diamond v0.9.22.123 (82) search databases for each of the three ten-gene marker gene sets identified for Prochlorococcus, SAR11, and all bacterioplankton using reference genomes from MARMICRODB. We also created Diamond databases for only Prochlorococcus pepM, only SAR11 pepM, and all bacterial and archaeal pepM peptide sequences identified with the "EDK(X)5NS" catalytic motif from MARMICRODB only. We identified the marker gene sets using a two-tiered search strategy using Diamond in default fast mode. We first searched the metagenomes against reduced marker sets clustered at 70% amino acid identity by MMseqs2. We pulled all reads with hits to these reduced marker sets, then searched the reads against the entire MARMICRODB, and finally retained read mappings with a best scoring match to a MARMICRODB protein from the original marker family. We searched metagenomic reads against the pepM databases using Diamond (mode --more-sensitive) and score cutoffs of 55% amino acid identity and an E-value of 1e-5. We determined these cutoffs empirically to be those that produced the highest F Score (harmonic mean of precision and recall) from mock metagenomes simulated from GORG-tropics (see next section, Fig. S11). Since the E-value is a function of database size it is important to note that the significance of this cutoff is specific to the reference databases here. In the case of ambiguous alignments (i.e. identical best alignment scores to sequences from different taxonomic groups) we classified reads using a probability-weighted random sampling of all taxonomic groups matching the best hits. We derived the probability distributions for random samplings from the taxonomic classification uniquely mapped pepM reads.
To normalize pepM reads across metagenomes, we calculated the length normalized counts (RPKM) of the 10 single copy marker genes from Prochlorococcus, SAR11, and all bacteria and archaea in each metagenome as reads per kilobase marker and then estimated the number of 'genome equivalents' in each metagenome as the median of the 10 marker families. We also calculated the length normalized abundance of pepM from Prochlorococcus, SAR11, and all bacteria and archaea as reads per kilobase from each metagenome. We then divided the length normalized abundance of pepM by the estimated number of genome equivalents for each taxonomic group to estimate the fraction of genomes with phosphonate biosynthesis potential. To ensure robust estimates of normalized abundance, we excluded samples where the median marker gene coverage was less than 100X for all taxonomic groups.

Quantification of pepM and marker genes using short metagenomic reads:
Metagenomic classification of diverse functional genes can present methodological challenges because the amount of discriminative taxonomic signal in these genes may be marginal or confined to specific regions of the gene. For example, pepM appears to be horizontally transferred both within closely related microbial groups like Prochlorococcus and between diverse microbial lineages. In some cases, it may be difficult to taxonomically assign a short metagenome read derived from a pepM gene because the read may map equally well to a pepM sequence from multiple taxonomic lineages. We quantified the skill with which we could classify pepM sequences derived from SAR11, Prochlorococcus, and all bacteria and archaea genomes using mock metagenomes of known composition.
We simulated mock metagenomes from 200 randomly sampled single cell template genomes from the approximately 13000 genomes from GORG-tropics. The 200 template genomes were sampled from all available taxonomic orders but were restricted to those genomes from each order that fell into the top 10% highest estimated completeness from checkM. Genomes estimated to be below 30% completeness were excluded. These criteria resulted in 74 SAR11 and 48 Prochlorococcus template genomes. We simulated 150 basepair paired-end Illumina sequencing reads mock metagenomes using randomreads.sh from BBMap v37.90 (https://sourceforge.net/projects/bbmap/) with 3x coverage, an insert distribution of 155-320 bp and parameters snprate=0.005, insrate=0.0005, delrate=0.0005, and subrate=0.0005. We next included all identified pepM sequences from GORG-tropics to represent the full diversity of pepM sequences in the surface ocean. As a decoy case we also included all GORG genes from the isocitratelyase superfamily, which have superficial similarity to pepM but lack the essential conserved active site motif "EDK(X)5NS." We then cut these authentic and decoy pepM gene sequences with 75 bp upstream and downstream flanking DNA sequence from their parent genomes and simulated sequencing reads using the same parameters for the full genomes but with a sequencing depth of 30x. This step produced sequencing reads derived from 25, 96, and 14 authentic pepM sequences from Prochlorococcus, SAR11, and all remaining bacteria and archaea, respectively. It also generated reads from 49 and 134 isocitrate lyase decoy sequences from bacteria and archaea and SAR11, respectively. The final resulting simulated metagenomic reads were preprocessed and merged as the Tara ocean and GEOTRACES metagenomes.
Because we knew the precise quantity of pepM-derived reads in the simulated metagenome we could then iteratively calculate the true positive rate, true negative rate, positive predictive value, and the F1 score (harmonic mean of the true positive rate and positive predictive value) for different combinations of sequence similarity cutoffs. In this case the positive predictive value is the probability that a metagenomic read classified as pepM is derived from a pepM sequence, the true positive rate is the percentage of true pepM reads correctly identified as such, and the true negative rate is the percentage of false pepM reads correctly identified. We selected the percent identity and E value classification cutoffs that produced the highest F1 score for pepM. Ultimately, we selected an E value cutoff of 1e-5 and a percent identity cutoff of 55%, which resulted in a true positive rate of 95% and a true negative rate of 99% for classifying any bacterioplankton pepM (Fig. S11). For Prochlorococcus pepM and SAR11 pepM sequences, respectively, our classification criteria produced a true positive rate of 96%/60% and a true negative rate of 97%/98%. These performance metrics suggest that, in all cases, we are unlikely to classify superficially related reads as a true pepM sequence. For Prochlorococcus and combined bacteria/archaea it appears that we correctly capture 95-96% of the total pepM sequence diversity in the mock metagenomes. The true positive rate for SAR11 was significantly lower (60%) which suggests that 40% of SAR11 pepM sequences are missed and probably assigned only to the bacteria/archaea domain level. This result is consistent with the high sequence similarity between a handful of pepM genes from SAR11 and other diverse bacterial groups, and suggests that our metagenomic database probably underestimates the contribution of SAR11 to the total inventory of pepM genes in the ocean. For example, the abundance discrepancy between GORG-tropics genomes and the metagenomes for SAR11 pepM is likely explained by the low estimated sensitivity (60%) for our method to SAR11. Assuming we missed 40% of SAR11 pepM sequences, the corrected percentage of SAR11 genomes with pepM estimated from metagenomes is 17% which agrees very well with the independent results from GORG-tropics (18%) ( Table S1). The misclassification of SAR11 pepM to other groups probably also accounts for some of the significant positive correlation between SAR11 and combined bacteria/archaea pepM abundance (Fig. S9). Overall, these classification metrics indicate that we are unlikely to misclassify pepM reads from marine metagenomes and suggest our relative abundance estimates are robust.

Biotic and abiotic data associated with metagenomes:
We obtained and preprocessed the biotic and abiotic variables used in this study as described in detail here: https://doi.org/10.5281/zenodo.3689249 and here: https://doi.org/10.5281/zenodo.3786232. We obtained inorganic phosphate concentrations from the GEOTRACES Intermediate Data Product IDP2017 version 3 (83), specifically from sections GA02 (84,85), GA03, GA10(86), and GP13. We obtained inorganic phosphate and DOP concentrations from the Tara Oceans project (87) (https://doi.pangaea.de/10.1594/PANGAEA.875579). We obtained modeled climatological dissolved organic phosphorus and other variables from the MIT Darwin model (v0.1_llc90, http://darwinproject.mit.edu/) from the Simons Collaborative Marine Atlas project (88) using pycmap v0.1.2 (https://doi.org/10.5281/zenodo.3561147). We generated taxonomic profiles derived from metagenomic reads as described earlier (88) using the MARMICRODB database. As a first order estimate of Prochlorococcus and SAR11 ecotype relative abundance we divided the number of reads mapping to each ecotype by the number of reads mapping in total the Prochlorococcus genus or the family Pelagibacterales for SAR11. Some proportion of reads mapping to highly conserved (core) regions can only be reliably classified at the genus or family level thus our ecotype relative abundance estimates using total read counts are underestimated.

Random Forest Regression:
We used random forest regression, a nonparametric approach, to identify biotic and abiotic factors structuring the distributions of phosphonate producers. We selected random forest regression because the dataset has nearly 45 abiotic/biotic variables, many of which are correlated, and random forest is capable of handling many predictor variables and is robust to covariate co-linearities. We trained one random forest model (n trees = 1000) for each taxonomic group; Prochlorococcus, SAR11, and bacteria plus archaea. Each model was trained using up to 44 abiotic/biotic variables including trace metal and macronutrient data from GEOTRACES and Tara Oceans, modeled climatological means from the MIT Darwin model (http://darwinproject.mit.edu/), and ecotype relative abundances. For the outcome variable (normalized pepM relative abundance) we used a modified splitting rule for tree construction that maximized the log-likelihood of the beta distribution on the interval [0,1] (89). Although our relative abundance measure is not theoretically restricted to the unit interval (values greater than one could exist, e.g. if the pepM copies per genome greatly exceeded one), in practice pepM relative abundance was always bounded [0,1] in our datasets. We performed random forest regression using nested ten-fold cross-validation to prevent data leakage to the validation phase (90). During the cross-validation and tuning phase, we performed feature selection heuristics to find the top 50th percentile of informative variables for Prochlorococcus, SAR11, and combined bacteria/archaea. We reserved 20% of the data for estimating final model performance. The remaining 80% of training data was split into 10 resampled partitions, each with analysis and assessment partitions, to tune and estimate the performance of preprocessing, supervised feature selection, hyperparameter tuning steps. We tuned hyperparameters (mtry, min.node.size), by maximizing the coefficient of determination from correlation (model R 2 ) and minimizing the root mean square error (RMSE). We used CAR Scores (91) for recursive feature elimination to retain only the top 50th percentile of informative variables. We included the feature elimination step to reduce the computation costs and runtime of the feature importance step (see below). Random forest regression was implemented with the package Ranger (92) and the Beta Forest algorithm (89). For Prochlorococcus we ultimately selected 10 model covariates, a minimal node size of 2, and 7 randomly sampled variables at each split resulting in a coefficient of determination (R We determined predictor variable rankings on the final model from the cross validation step using the Boruta heuristic (93). This step allowed us to identify all predictor variables that consistently performed better than chance and to compare the importance of each variable to a reference importance level, i.e. random data. The Boruta algorithm is a robust, statistically grounded feature ranking method that aims to identify all variables in a dataset that contain useful information for model predictive outcome. This approach is useful when the variables driving the performance model are of interest and not necessarily the ultimate predictive outcome itself. Boruta performs feature ranking by comparing all variables with randomized versions of themselves in an iterative framework. In addition to variable ranking based on relative scores, Boruta, therefore, provides a measure of the significance of each variable relative to noise. There were 10 and 15 covariates that showed better predictive utility than randomly shuffled data for Prochlorococcus and SAR11, respectively (Fig. S7). However, most predictive utility was concentrated into the top 5 variables, while the importance of the other lower ranking variables was in most cases only marginally higher than noise. The five most highly predictive variables for Prochlorococcus phosphonate producers (Fig. S7, Fig. S9, Table S2) included inorganic phosphate concentrations, the modeled climatological mean DOP concentrations, the subsurface chlorophyll maximum layer type, and the relative abundance of the Prochorococcus LLIV clade. The subsurface chlorophyll maximum layer types 1-4 were characterized by deepening SCML depths (~ 20 to 115 meters) while types 5 and 6 were characterized by high fluorescence throughout the upper 75 meters. The top five variables for SAR11 were the relative abundance of total phosphonate producers in the sample, the climatological mean nitrate concentrations, SAR11 IV relative abundance, SAR11 clade IIb relative abundance, and dissolved aluminum concentrations (Fig. S7, Fig. S9, Table S2). The integrated bacteria/archaea phosphonate producer model performed best (R 2 combined = 0.94) and the top five variables (SAR11 clade IIb relative abundance, SAR11 pepM relative abundance, archaea relative abundance, SAR11 clade IV relative abundance, Depth) were strongly informative for the random forest prediction when compared to noise. We confirmed these factors for all groups using parametric beta-binomial regression (Table S2) (94).

Beta-Binomial Regression and Generalized Additive Models:
We used the R package corncob (94) and modeled pepM relative abundance directly from pepM read counts and the median read counts to marker gene sets as "successes/total" which is appropriate for the beta-binomial probability distribution. We used the log-odds link function for both relative abundance and the overdispersion parameter. For each taxonomic group (SAR11/Prochlorococcus/bacteria and archaea) we modeled only the top five most important biotic/abiotic variables identified in the Random Forest regression and variable importance steps. We did not include additional model terms because multiple collinearities between covariates and the additional model complexity prevented model convergence in most cases. We estimated the probability for each biotic/abiotic covariate being informative to the overall model by using bootstrapped likelihood ratio tests (N=1000). We estimated 95% confidence intervals for model coefficients and standard errors using 1000 random draws from the beta-binomial distribution. We estimated seasonal effects in time-series metagenomes using Generalized Additive Mixed Models and Linear Mixed-Effect Models implemented through the mgcv v1.8-26 (95) nlme v3.1-148 libraries in R v3.6.2. To decompose any potential seasonal effects, we fit a cyclic spline term to a variable for the day of the year, which we use as a proxy for season, and we fit a global trend term to the cumulative time since sampling onset. We considered a seasonal effect present if the model term for "day of the year" was statistically significant (p < 0.05).

Prochlorococcus cultures under P-replete and P-starved conditions:
To investigate the linkage of pepM with phosphonate production, we grew two HLII strains of Prochlorococcus: Prochlorococcus SB, which has pepM, and the closely related Prochlorococcus MIT9301, which lacks the pepM gene sequence. We grew both strains axenically, under constant light (30 µmol quanta m -2 s -1 ) in artificial seawater medium AMP1 prepared as described before (96), but using 3.75 µM TAPS as a buffer instead of 1 mM HEPES. We first grew Prochlorococcus SB in the standard medium i.e. with an inorganic phosphate concentration of 50 µM (N/P = 16/1) to assess whether or not this strain could produce phosphonate under P-replete exponential growth conditions. We then grew them in media with a decreased the inorganic phosphate concentration to 2.28 µM (N/P = 350/1), which did not limit growth during the exponential phase of the cultures, but did result in a P-starved stationary phase culture as the cells ran out of P relative to other nutrients. We verified P-starvation by the initiation of further growth after adding inorganic phosphate to the culture (Fig. 4 inset). For all experiments, we grew biological duplicates to ensure reproducibility. We assessed culture axenicity by flow cytometry and by confirming a lack of turbidity for at least 30 days after inoculation with three test broths: ProAC (97), MPTB (98) and ProMM (96). We cleaned all the glassware and polycarbonate bottles by soaking overnight in 2% detergent (micro), rinsed 6 times with deionized water, soaked overnight in 1 M HCl and rinsed 6 times with ultra-high purity water.

Cell harvest and treatment:
We harvested Prochlorococcus SB cultures grown in high N/P medium twice: ~6 L during exponential growth and the remainder (14 L) two days after the onset of stationary phase growth. Culture fluorescence increased in the phosphate-amended cultures, reaching levels regularly observed in Prochlorococcus HLII cultures (Fig. 4 inset). We separated the cells from the growth medium by centrifugation (15,970 rcf for 30 minutes at 4°C) and the growth medium was saved for other analyses. We transferred the cell pellets into 50 mL falcon tubes, suspended in Turk Island mix (96) to rinse the cells of external nutrients and centrifuged them (6,523 rcf for 15 minutes at 15°C). We discarded the supernatant and repeated the operation two more times. We flash froze the cells in liquid nitrogen and stored them at -20 o C until NMR analyses. We also measured Prochlorococcus cell abundance by flow cytometry. We prepared and processed the samples as previously described (99,100) and ran on a Guava 12HT flow cytometer (Luminex Corp., Austin, TX, USA). Cells were excited with a blue 488 nm laser analyzed for chlorophyll fluorescence (692/40nm), SYBR Green I stained DNA fluorescence content (530/40nm), and size (forward scatter). We ran the samples stained with 1x SYBR Green I (Invitrogen, Grand Island, NY) and unstained then incubated for 60 min in the dark prior to running. All flow cytometry files were analyzed using Guavacyte.

Potential contribution of 1,3,2-dioxaphospholanes to the 18-27 ppm region of the 31 P NMR spectrum:
Most cyclic phosphate esters, such as 3´,5´-cyclic nucleotide monophosphates have chemical shifts that fall upfield of acyclic phosphate esters and would therefore not interfere with our interpretation of the Prochlorococcus SB NMR spectrum (101). However, some cyclic phosphates, specifically 1,3,2dioxaphospholanes such as 2´, 3´-cyclic nucleotide monophosphates (2´,3´-cNMP), have 31 P chemical shifts that fall within the same spectral region as phosphonates (101,102). 2´,3´-cNMPs are unstable, and are not known to accumulate in cells (102), however to substantiate this we tested two commercially available model compounds for their stability towards mild base hydrolysis; 2´,3´-cyclic adenosine monophosphate and 2´,3´-cytidine monophosphate. Treatment of these cNMPs with 0.1N NaOH for 2 hours completely hydrolyzed the cyclic phosphate as indicated by an upfield shift of the NMR signal from ~19-20 ppm to ~3-4 ppm (Fig. S13). Treatment of Prochlocococcus SB (harvested in stationary phase after exponential growth) with 0.1N NaOH for 2 days, yielded no change in the phosphonate ester/phosphate ester integrals of the spectra (Fig. S13). The pH of the incubation was checked before and after the experiment to ensure basic conditions were maintained throughout. If 2´,3´-cNMPs contributed significantly to the 31 P NMR signal at ~ 22 ppm, we would have expected a decrease in the phosphonate/phosphate ester ratio after hydrolysis. This was not observed. Finally, strong base hydrolysis of the Prochlocococcus SB cells (2N KOH, 85 o C, 24 hr) yielded 2-hydroxyethylphosphonate as the major product, confirming the synthesis of phosphonates by Prochlorococcus SB.
2´,3´-cNMPs are formed by transesterification of nucleotides by the enzymes RNase A (Rnase 1) and binase during RNA depolmerization (103,104). We found no genes encoding RNase A or binase in the Prochlorococcus SB genome. While this analysis does not preclude the formation of 2',3'-cyclic NMPs by other as yet uncharacterized enzymatic pathways, it is consistent with the results of the hydrolysis experiments above that 2´,3´cNMPs do not contribute significantly to the 31 P signal at 22 ppm. Relative abundance estimates are corrected based on the average single cell genome completeness/recovery for each group (see methods). GORG is the Global Ocean Reference Genomes, which is a catalog of single cell genomes isolated from throughout the global tropical surface ocean. GORG-BATS248 is derived from a single 0.    The proportions of phosphonate (Phn) and phosphate (Ph) in the cells and the P/C ratio are used to calculate the Phn-P/C and Ph-P/C ratios. Errors associated the P/C ratio are calculated using the errors associated with the %C and %P in the cells and as: ∆

Errors associated with
Phn-P/C and Ph-P/C ratios are calculated as: ∆ Errors associated with the mean values are calculated using the standard error of the mean.  Figure S1: Genomic potential for phosphonate production and consumption in single-cell genomes from the tropical ocean Flow diagrams representing taxonomic hierarchy of A) phosphonate producers and B) consumers from the GORG-Tropics database. Flows of the same color spanning multiple taxonomic ranks indicate that the higher ranks are entirely subsumed by the lowest rank. For example, all cyanobacterial genomes were from the Prochlorococcus genus. Otherwise each partition at each taxonomic level is colored uniquely. For the quantitative comparison of GORG-Tropics sample BATS248 was randomly subsampled to the median depth of the other 27 samples. Letters at each taxonomic rank refer to key marine archaea/bacterial groups (key left) and bar size is proportional to relative abundance.

Figure S2: Distribution of phosphonate biosynthesis and utilization pathways in bacterial genomes.
Phylogeny is constructed from 120 concatenated, single-copy marker genes from 1890 bacterial genomes from MARMICRODB containing either a phosphonate catabolism or phosphonate biosynthesis pathway. Scale bar is 0.2 amino acid substitution and the tree is unrooted. Monophyletic taxonomic groups with marine representatives are highlighted. The presence of four different phosphonate catabolic pathways (Hpn: Phosphonate catabolism via HpnWXZ, PhnYZ: phosphite or methylphosphonate catabolism via PhnYZ (65), 2AEP: 2-aminoethylphosphonate catabolism via phosphonoacetate, CPLyase: multisubunit C-P lyase system) is displayed in blue on an inner ring while phosphonate/methylphosphonate biosynthesis pathways (pepM and mpnS) are shown in red on the middle ring. The outer summary ring indicates whether a genome contains at least one catabolic pathway (blue), a phosphonate biosynthesis pathway (red), or both (dark grey, < 0.5% of all genomes). For simplification, the MpnS category also includes the functionally related enzyme HepDI (63). For reference, HTCC7217 from the SAR11 clade 1a3 is both a producer and consumer. Figure S3: Similarities between genome, conserved gene families, pepM, and random tree topologies.
Ternary plot displaying similarities between different phylogenies using Mutual Clustering Information (MCI). Phylogenies more similar to the multi-gene concatenated genome phylogeny are positioned vertically towards the 100% shared information vertex. The 'Absent information' axis represents the resolved MCI relative to the theoretical maximum MCI. The Misinformation axis shows the amount of incorrect MCI between trees. The blue point is the comparison between the pepM phylogeny and the genome phylogeny, and the grey point is a comparison between a completely random phylogeny and the genome phylogeny. The green points are comparisons between conserved gene families (not included in the genome phylogeny) and the genome phylogeny. Each gene tree vs. genome tree comparison is presented alongside the MCI expressed as a distance (in brackets), but rescaled such that the expected distance between two completely random trees equals 1.

Figure S4: Distribution of phosphonate biosynthesis gene clusters within Prochlorococcus genomic islands
Includes a subset of all Prochlorococcus genomes with i) high enough completeness and low enough contig fragmentation for island prediction (see methods) and ii) that contain phosphonate biosynthesis genes. Genomes are ordered by phylogeny (left) and colored by ecotype/clade (right). Grey bars denote the location of predicted genomic islands and red vertical lines denote the location of phosphonate biosynthesis clusters within predicted genomic islands. Prochlorococcus SB genome assembly length is shown as a black horizontal line. The assembly lengths of single cell genomes are omitted due to incompleteness and contig fragmentation. Single cell genome contigs and genomic islands are arbitrarily ordered to Prochlorococcus SB.

Figure S6: Phosphonate consumers and producers in 28 GORG-Tropics samples
Same as in Figure 3A, B but analyzing Prochlorococcus and SAR11 separately. Proportions [%] are total producers or consumers divided by the total number GORG assemblies and are corrected using the estimated sequence recovery from assemblies (see methods     When Prochlorococcus SB cells are P-starved, their P/C and Ph-P/C ratios are not measurably different from cells grown under P-replete conditions (left, right), whereas their Phn-P/C ratio (center) increases, indicating an increase in phosphonate content (Table S4). As before, errors associated with the mean values are calculated using the standard error of the mean.    31 P NMR spectra of major biochemical fractions extracted from Prochlorococcus SB. (Top) Carbohydrates extracted with cold aqueous trichloroacetic acid include phosphoglycans that appear at ~ 0 ppm, while no phosphorus was detected in the lipid fraction extracted with aqueous ethanol (Center). As expected, phosphate esters were recovered in the nucelic acid fraction (~ 0 ppm, Bottom). Phosphonate and some phosphate esters appear in the protien fraction (Fig. 4).