Genome-wide association identifies several QTLs controlling cysteine and methionine content in soybean seed including some promising candidate genes

Soybean is an important source of protein, oil and carbohydrates, as well as other beneficial nutrients. A major function of proteins in nutrition is to supply adequate amounts of amino acids. Although they are essential for human nutrition, the sulfur-containing amino acids cysteine (Cys) and methionine (Met) are often limited and the genetic control of their content in soybean seeds is poorly characterized. This study aimed to characterize the phenotypic variation and identify quantitative trait loci (QTL) associated with Cys and Met content in a core set of 137 soybean lines, representative of the genetic diversity among Canadian short-season soybean, spanning maturity groups 000-II (MG000-II). Significant phenotypic differences were found among these lines for Cys, Met and Cys + Met content. Using both a mixed linear model and six multi-locus methods with a catalogue of 2.18 M SNPs, we report a total of nine QTLs and seventeen QTNs of which seven comprise promising candidate genes. This work allowed us to reproducibly detect multiple novel loci associated with sulfur-containing amino acid content. The markers and genes identified in this study may be useful for soybean genetic improvement aiming to increase Cys and Met content.

Soybean (Glycine max (L.) Merr) is a legume species native to East Asia and now widely planted for its edible bean, which plays an essential role in global food security 1 . Although soybeans are classified as an oilseed with 18-23% oil on a moisture-free basis, its increased production worldwide can be attributed also to its high protein content ranging 38-44% 2 . Soybean seed proteins are categorized as either albumins or globulins based on their solubility patterns 3 . The main storage protein is comprised of the globulin fraction, accounting for roughly 70% of the total seed protein. The 7S (β-conglycinin) and 11S (glycinin) polypeptides are the major components of the globulin fraction. This fraction provides 18 amino acids, including all 10 essential amino acids, but the sulfur-containing amino acids cysteine (Cys) and methionine (Met) are not present at adequate levels for optimal growth and development of monogastric animals 4 .
Cysteine and methionine contribute highly to human health and animal nutrition by preventing cancer, the development of cardiovascular diseases and promoting proper function of the immune system 5 . Unlike fat and starch, the body does not store excess amino acids for later use. Therefore, amino acids must be obtained from food daily 6,7 . Considering their role in maintaining human and animal health and their low level in soybean protein (< 3% of all amino acids in seed), developing soybean varieties with improved cysteine and/or methionine content is an important breeding objective to address soybean nutritional limitations 8 . Understanding the precise genetic architecture of these traits and identifying key genes controlling the accumulation of these protein fractions or amino acids would be an important step in this direction. So far, fifteen genes encoding β-conglycinins (CG1-CG15) and five genes (Gy1-Gy5) encoding glycinins have been reported 9,10 (http://www.soyba se.org/). Seed content in sulfur amino acids has been shown to be quantitatively inherited and controlled by multiple genes 3,6 . Only a few studies have reported quantitative trait loci (QTL) associated with cysteine and methionine content in soybean seed (SoyBase, https ://soyba se.org/). Qiu et al. 11 predicted 12 candidate genes based on the

Results
Phenotypic variation and correlations among traits. In view of performing a GWAS, we obtained phenotypic data for the sulfur amino acid content (g/kg TP) on a set of 137 soybean lines by growing single-row plots on two sites (two replicates/site) in 2013. The frequency distributions for Cys, Met and Cys + Met contents are presented in Fig. 1. All of the seed composition traits examined in this study exhibited a normal distribution as the p-value of a shapiro test for the three traits was < 0.05 and appeared to be quantitatively inherited. Descriptive statistics for Cys, Met and Cys + Met contents are presented in Table 1. It can be seen that the amino acid content varied from 13.5 to 15.0 g/kg on total protein basis (TP) for Cys and from 12.6 to 13.6 g/kg TP for Met. The total sulfur amino acid content Cys + Met varied from 26.0 to 28.5 g/kg TP. Across all 137 lines, the means were 14.6, 13.0 and 27.6 g/kg TP for Cys, Met and Cys + Met, respectively. The least significant difference (LSD) between two genotype means was 0.3 for Cys, 0.2 for Met and 0.5 g/kg TP for Cys + Met contents. An analysis of variance showed that the genotype main effect was a much greater source of variation p ≤ 0.001 than the environment and genotype by environment effect, but the latter was found to be a more important source of variation (p ≤ 0.01) for the sulfur amino acid content than the site. A high broad-sense heritability was observed and ranged from 49 to 62%.

Scientific Reports
| (2020) 10:21812 | https://doi.org/10.1038/s41598-020-78907-w www.nature.com/scientificreports/ To assess if these three traits were associated, we measured the correlation between them across the two environments. All three amino acid contents were positively and highly significantly correlated (p < 0.001). The Pearson correlation coefficients between these traits ranged from 0.81 to 0.97 (Table S1, upper diagonal) and the genetic correlation explaining the proportion of variance that two traits shared due to genetic causes ranged from 0.89 to 0.97 (Table S1, bold lower diagonal).
Genotyping and SNP calling. Using the Fast-GBS pipeline for SNP calling, 93% of previous GBS singleend reads (~ 203 million 100-bp Illumina HiSeq2000) were successfully mapped onto the Wm82.a2.v1 reference genome. A total of 56 K high-quality SNPs were obtained and so called «GBS-derived SNP». To obtain an extensive SNP coverage of the genome, we used a two-step approach to genotype the 137 lines of this association panel (Fig. S1). In a first step, we re-analyzed previously obtained reads derived from GBS analysis to take advantage of an improved SNP-calling pipeline and reference genome. Overall, 93% of ~ 203 million 100-bp Illumina HiSeq2000 single-end reads successfully mapped onto the reference genome and 56 K high-quality GBS-derived SNPs were identified using the Fast-GBS pipeline (see "Materials and methods" section). In a second step, to further densify marker coverage, we performed the imputation of close to 4.3 M SNPs by using a set of 102 resequenced lines (of which 56 were common to both sets of lines) as a reference panel. This fully imputed set of 4.3 million SNPs was filtered to remove InDels and to retain markers with a MAF ≥ 0.05 and heterozygosity ≤ 0.1.  Table 1. Descriptive statistics for cysteine, methionine and the sulfur amino acid content across two sites (two replicates per site). LSD at the 0.05 probability level. Var(G) = genotypic variance, Var(E) = environmental variance, Var(GxE) = Genotype by environment variance and H 2 = Broad sense heritability. NS Not significant at α = 0.05. **, ***Significant at the 0.01 and 0.001 probability levels, respectively. Population structure. Based on a posterior Bayesian clustering analysis performed using a subset of pruned markers (r 2 ≤ 0.5), the optimum number of subpopulation (k) was 7 (Fig. 2a). A phylogenetic tree in (Fig. 2b) constructed with the same subset of markers showed seven main branches with bootstrap values ≥ 50%. Similarly, for PCA ( Fig. 2c), the total variance explained by each principal component (PC) decreased from PC1 to PC7 and, after PC7, the variance explained by each further PC remained low and stable. Together, these results suggested that k = 7 provided a good assessment of population structure (illustrated in Fig. 2). These subpopulations tended to contain lines belonging to the same maturity group and/or originating from the same breeding programs and the corresponding Q matrix was used for GWAS.
Genome-wide association scan for sulfur amino acid content. To identify genomic regions associated with the seed content in sulfur amino acids, a genome-wide scan was performed using a MLM and six multi-locus models implemented in the mrMLM package v4.0. A pruned set (r 2 ≥ 0.9) of SNP markers and a single phenotypic value (BLUP) for each trait were used. For the MLM analyses, as can be seen in the Q-Q plots shown in Fig. S3, this method successfully limited the confounding effects of population structure and kinship as the observed p-values only diverged from the diagonal (expected p-values) at the most extreme values (beyond 3E−03 for all three traits). The results of these MLM association analyses are presented as genome-wide Manhattan plots in Fig. 3. With the chosen threshold for false discovery rate (blue horizontal line, FDR ≤ 0.05), only 0.03% of the SNPs (71 of the 243 K) were significantly associated with Cys content, with the uncorrected p-values ranging from 1.37E−05 to 1.82E−09. There were 41 markers (0.02%) significantly associated with Met content (p-value from 8.02E−06 to 1.25E−08) and 71 SNPs (0.03%) with Cys + Met content (p-value from 1.32E−05 to 9.56E−10) as uncorrected p-values.
To identify distinct QTLs, we characterized haplotype blocks formed by these associated SNPs according to Gabriel et al. 27 . For SNPs associated with more than one trait, the most significant SNP (lowest FDR, across all significantly associated traits) was retained to mark a given haplotype block. In total, nine haplotype blocks www.nature.com/scientificreports/ (numbered 1-9 in Fig. 3) were defined of which eight were associated with Cys, four with Met and seven with Cys + Met. In three cases, the same QTL/haplotype block was detected for all three traits (QTLs #4, 5 and 8 on Gm08, 10 and 20) and in four cases, the same QTL was shared between two of the three traits, Cys and Cys + Met (QTLs #2, 6 ,7 and 9 on Gm08, 14, 19 and 20). Only two QTLs (QTLs #1 and 3 on Gm05 and 08) were unique to a single trait (Cys and Met, respectively). The features of these nine QTLs are summarized in Table 2. The portion of phenotypic variance explained (R 2 ) varied between 15 and 27%. The magnitude of allelic effects were calculated according as per Kang et al. 28 and varied between 0.12 and 0.42 g/kg TP. From the multi-locus analyses, ~ 0.01% of the SNPs (36, 36 and 38 of the 243 K) were significantly associated with Cys, Met and Cys + Met content, respectively. The LOD score of these associated SNPs ranged from 3.26 to 17.65 for Cys, 3.05 to 18.98 for Met and 3.04 to 13.15 for Cys + Met content. Across the six multi-locus methods (Fig. 4), pKWmEB detected the fewest associated SNPs (0), while the highest number was detected by pLARmEB (14 SNPs associated with Cys content).
Here, we report QTNs that met the following criteria: (i) the associated SNP was detected by at least two multi-locus methods and (ii) explained at least 5% of the phenotypic variance (R 2 ). In total, seventeen QTNs/ haplotype blocks were identified in this fashion including three QTNs codetected by both MLM and multi-locus methods (Table S2). In most cases, using multi-locus methods, the exact same SNP was the most highly associated SNP with each trait, although occasionally, a different SNP residing in the same haplotype block was most highly associated (e.g. Gm08: 43,871,472 for ISIS EM-BLASSO and Gm08: 43,877,254 for mrMLM). In the latter case, to be conservative, the SNP with the lowest LOD score was reported as the QTN. Of these seventeen QTNs, four (Cys), three (Met) and five (Cys + Met) were unique to a single trait. One QTN was shared by each of two Each dot indicates the degree of association between a single marker and a trait (y-axis) while the x-axis shows the physical position of each marker. A blue horizontal bar indicates the significance threshold (FDR ≤ 0.05) and significantly associated markers are coloured in green. The red numbers (from 1 to 9) identify the different QTLs based on the haplotype blocks formed by the significantly associated markers. The pink vertical stripes highlight QTLs that were common to all three traits and the black arrows indicates the overlapped QTL between MLM and multi-locus methods.

MLM QTL validation.
To validate the QTLs discovered by the MLM method, data from three additional trials were obtained. Given the difference in original purpose and experimental design of this second set of trials, the resulting data were used to assess if the phenotypic mean differed between lines carrying one or the other allele at each QTL (Table S3). For Cys, seven out of the eight QTLs resulted in a significant difference (p < 0.006) in the mean Cys content between lines carrying contrasting alleles in at least one of the three validation trials. Six out of eight QTLs were validated in at least two trials and five QTLs were validated in all three trials. For Met, of the four QTLs discovered initially, all four were validated (p < 0.01) in all trials. For Cys + Met, all seven QTNs were validated (p < 0.007) in at least one trial, six were confirmed in at least two trials and five were validated in all three trials (Fig. 5). Together, these data strongly suggest that the QTLs identified through GWAS are robust and highly reproducible across multiple environments.

Multi-locus QTN validation.
To validate the QTNs discovered by the multi-locus methods, the same approach and phenotypic data used for validation of MLM-derived QTLs was used. With the exception of QTN_#7 in the OT_I_2018 environment, unique QTNs (i.e. not including the three coinciding with QTLs) did not produce a significant phenotypic contrast between lines carrying alternate alleles at these loci (Table S4). Thus, we find that the QTNs discovered using a multi-locus model could seldom be validated using an approach relying on the observation of phenotypic contrasts resulting from considering these loci individually.   Table S5. An example of this approach is illustrated in Fig. 6. Among these haplotype blocks, the haplotype block codetected by MLM and by four of the six multi-locus methods (FASTmrMLM, ISIS EM-BLASSO, mrMLM and pLARmEB) corresponding to QTL #9 (MLM) and QTN #17 (multi-locus) spanned 64 kb and included 12 genes. Among these genes, Glyma.20g129700 was located 45 kb upstream of the MLM peak SNP (Gm20: 37,038,638) and 35 kb upstream of the multi-locus peak SNP (Gm20: 37,049,294). This gene was annotated as being involved in methionine biosynthesis (GO: 0009086). In addition, within haplotypes identified from MLM analysis, only one other candidate gene had an annotation evoking an association with sulfur amino acid content. Glyma.08g367600 is located 10.8 kb downstream of the peak SNP in QTL #4 (Gm08: 47,797,482), encodes a protein that is predicted to have an oxido-reductase activity and its annotation (GO: 0000096) suggests a role in sulfur amino acid metabolism. Five other candidate genes were identified within four haplotype blocks identified only by the multilocus methods and correspond to QTNs #1, #10, #11 and #14 (Table 3). Among them, Glyma.13g108800, Glyma.14g003200 and Glyma.14g003400 were all annotated as being involved in cysteine biosynthetis. Glyma.13g108800 was found among 11 other genes in a 71-kb haplotype block and located 17 kb uptream of the peak SNP corresponding to QTN #10 (Gm13: 22,275,813). The last two candidate genes were found within a haplotype block of 149 kb corresponding to QTN #11 and included 24 other genes. These were located 61 kb and 50 kb downstream of the peak SNP (Gm14: 367,224).
Two other candidate genes, Glyma.04g237300 and Glyma.16g032200, had an annotation indirectly associated with the methionine biosynthesiss. They were located 43 kb and 23 kb uptream of the peak SNPs (QTN #1, Gm04: 50,542,407) and (QTN #14, Gm14: 3,028,024), respectively. Glyma.04g237300 was annotated to have a spermidine synthase activity and found among 21 other genes in a 243-kb haplotype block while Glyma.16g032200 was located in a 56-kb haplotype block containing 6 genes and is annotated as being involved in ethylene biosynthetiss.
We then investigated the expression profile of these seven candidate genes. Based on previous transcriptomic analyses, these genes (except Glyma.08g367600) were known to be mainly expressed in young leaves, flowers and roots but also expressed in pods 7 days after flowering (DAF), pod shells (10-13 DAF) and in seeds (14-17 and 35 DAF) as illustrated in Figs. 7 and S4. There were no transcriptomic data available yet for the expression profile of Glyma.08g367600, likely due to the fact that it is present only in the current Wm82.a2 assembly/annotation. Structural and nucleotide variation within candidate genes and their predicted functional impact. Given the availability of WGS data for 56 lines of our association panel, we looked for structural or nucleotide variation within the eight candidate genes identified previously. No structural variants > 51 bp were identified in any of the candidate genes within these 56 lines. We also examined if SNP variants within coding regions could impact gene function. In total, 49 nucleotide variants were identified within coding regions of  www.nature.com/scientificreports/ seven candidate genes. All of these variants were predicted as having a "low" to "modifier" effect. As such, we do not consider these variants to be likely causal.

Discussion
Phenotypic variation and correlations among traits. Normal distributions were observed for all the traits (Shapiro test p-value < 0.05) and suggested that these traits are controlled by polygenes. The phenotypic variation ranged from 13.5 to 15.0 g/kg for Cys and from 12.6 to 13.6 g/kg for Met content. The total sulfur amino acid content Cys + Met varied from 26.0 to 28.5 g/kg on the basis of total protein. The observed phenotypic values reported in this study were slightly lower but similar in range to those reported in previous studies when similarly expressed in terms of g of amino acid per kg of total protein. Warrington et al. 3 reported 14.8 to 16 g/kg and 13.9 to 14.7 g/kg for Cys and Met content. Song et al. 19 reported 12.1 to 14.5 g/kg and 11.5 to 13.7 g kg respectively for Cys and Met contents. However, in terms of range, the results reported here correspond most closely to those reported by 3 (1.4 and 0.9 g/kg, respectively, for cysteine and methionine content). Moreover, similarly to previous studies, high and positive correlations (0.81 to 0.97) were seen between cysteine, methionine and the total sulfur amino acid contents. It has been suggested that the high phenotypic and/or genotypic correlations between cysteine and methionine contents are due to a shared biosynthetic pathway or a similar molecular basis of inheritance 6,30 .

Genome-wide association analysis.
For marker-trait association analysis, a total of 243 K SNP markers and a single phenotypic value for each line (BLUP) was used in two GWAS approaches (one single-locus (MLM) and six multi-locus (FASTmrEMMA, FASTmrMLM, ISIS EM-BLASSO, mrMLM; pLARmEB, and pKWmEB) methods). In all cases, we incorporated population structure and family relatedness to control for confounding effects. We reported a total of nine QTLs/haplotype blocks associated with the sulfur amino acid content detected by an MLM approach and seventeen major QTNs detected by at least two of the six multi-locus methods. Interestingly, three chromosomal regions were detected using both approaches. Among MLM-derived QTLs, one third of these SNPs were significantly associated with all three traits. Importantly, approximately two thirds of these QTLs proved extremely robust as they could be successfully confirmed in three additional trials. This suggests that these individual markers would provide a good tool for breeders interested in using marker-assisted selection (MAS) to improve Cys and/or Met content within this germplasm. In contrast, however, the QTNs uncovered in the multi-locus analysis could not be validated in the same fashion. One must keep in mind though that these QTNs were discovered in a model based on the joint consideration of these loci whereas our validation experiment was based on the consideration of loci individually. This may have contributed to our inability to confirm the impact of the QTNs on these traits. Nonetheless, in the context of MAS, it is likely that these QTN would not prove useful; these might be better captured in a genomicselection scheme relying on a wider set of loci and allowing for multi-locus contributions to the phenotype. Expression strength coded by color: yellow = low, red = high. As shown, Glyma.20g129700 is most highly expressed in flowers, pods, roots, pod shells and in the seed. Data derived from RNA-seq of Glycine max, published by Severin et al. 21 and download from eFP browser 29 (www.bar.utoro nto.ca).

Scientific Reports
| (2020) 10:21812 | https://doi.org/10.1038/s41598-020-78907-w www.nature.com/scientificreports/ In previous mapping work, the total number of QTLs associated with cysteine and methionine content varied from 1 31 to 14 17 in GWAS studies while 2 14 and 8 QTLs 32 were reported following family-based mapping. As we observed here, a subset of QTLs were common to both cysteine and methionine content 6,7,31 and, in one case, all the QTLs (eight) were associated with both cysteine and methionine content 32 . This is unsurprising given the high degree of correlation between these traits and the fact that both amino acids share the same biosynthetic pathway with cysteine being a precursor of methionine production 33 .
To compare the genomic regions reported by previous studies with the QTLs reported in this study, we used SoyBase to define the physical position of previously reported QTLs. Two of the three QTL/QTN codetected by both approaches in this study fell within genomic intervals identified in previous biparental mapping work. Indeed, Wang et al. 32 reported one QTL Cys 3-7 located from 3.0 to 35.9 Mb on chromosome 20 associated with seed cysteine content. Moreover, this genomic region fell within QTL qMet_Gm20 and qCys + Met_Gm20 None of the seven other QTLs that we reported from MLM analysis in this study coincided with previously reported QTLs intervals identified in family-based mapping. Similarly, none of the QTLs reported by previous GWAS studies 17,31,35 were located within 500 kb of those reported in this study. Except for the work of Zhang et al. 31 in which only one QTL was reported, all other GWAS studies have been conducted within later maturity groups (MG I-IX) and the limited overlap in QTLs could be attributed to differences in the loci controlling these traits within the different maturity groups.
Considering that sulfur amino acids are components of protein fractions, notably the glycinin (11S) and conglycinin (7S) storage proteins, one might expect that genomic regions reported as significantly associated with sulfur amino acid content also be associated with the abundance of 11S and 7S, which together constitute nearly 70% of total storage proteins in the soybean seed 12 . Interestingly, Ma et al. 36 reported two QTLs significantly associated with glycinin and beta-conglycinin content on chromosome 10 (174 kb and 2.7 Mb) and 20 (34.6 to 37.8 Mb) a genomic region spanning QTL #5 and #9 reported in this study (Gm10 at 1.8 Mb) and (Gm20 at 37.0 Mb) significantly associated with all three traits.
For the multi-locus analyses, we reported a total of seventeen QTNs across twelve chromosomes including three shared regions with the MLM method. Across the traits, three QTNs were detected for all three traits. Two QTNs were shared by two trait pairs (one each for Cys/Met and Met/Cys + Met content). Four (Cys), three (Met) and five (Cys + Met) were unique to a single trait. Similar to the QTLs discovered via MLM, very little overlap was observed between previously reported QTLs and the multi-locus QTNs identified in this study. Among the seventeen QTNs, QTN #10 located on chromosome 13 (Gm13, 22.2 Mb) coincided with one QTL reported by Qiu et al. 11 to be significantly associated with cysteine content (flanked by BARC-042289-08234 and BARC-045205-08910).
Four of the QTNs reported in this study (QTN #1 and #2 at 48.4 and 50.5 Mb on Gm04; QTN #14 at 3.0 Mb on Gm16 and QTN #17 at 37.0 Mb on Gm20) were significantly associated with all three traits and these coincide with three broad genomic regions reported by Ma et al. 36 that were significantly associated with seed glycinin, beta-conglycinin and seed glycinin to beta-conglycinin ratio content across three chromosomes (Gm04: 7.6 to 51.2 Mb, Gm16: 569 kb to 29.5 Mb and Gm20: 34.6 to 37.8 Mb).
Candidate genes and their functions for sulfur amino acid accumulation. The peak SNPs associated with sulfur amino acid content reside in the nine and seventeen haplotype blocks identified through MLM and multi-locus analysis within which one would expect to find the causal genes that control Cys and Met content in this collection of soybean lines. Of the genes located within these haplotype blocks, seven stood out as promising candidate genes based on their GO annotations.
Of these seven candidate genes, only Glyma.20g129700 was identified using both MLM and multi-locus analysis and encodes a 5-methylthioribose kinase protein. This protein was demonstrated to be a primary enzyme involved in the recycling of the methylthio group of 5-methylthioribose back into methionine 37,38 .
A second promising candidate gene identified using MLM, Glyma.08g367600, encodes a protein that is predicted to have an oxido-reductase activity acting on NAD(P)H. Although the annotation for this gene indicates a role in sulfur amino acid metabolism, we have not been able to pinpoint the precise role of the protein encoded by this gene in sulfur amino acid metabolism. Moreover, no information from previous studies or transcriptomic data was available for this gene. Its role in the metabolism of sulfur amino acids needs to be confirmed.
Among the five other candidate genes identified from multi-locus analysis, Glyma.04g237300 was annotated as a spermidine synthase and said to be involved in methionine synthesis in plants. From the work of Qiu et al. 11 , the spermidine synthase encoding gene was found in association with eleven other genes to contribute 6-38.5% of the genetic variation of sulfur-containing amino acids.
Glyma.13g108800 encode a stress-responsive alpha-beta barrel protein and is thought to be involved in both cysteine and methionine synthesis under stress conditions. Cysteine-containing molecules like glutathione and thioredoxin play a major role in maintaining an intracellular reducing environment and protecting the cell from oxidative damage 39 .
Glyma.16g032200 encodes a protein that is involved in ethylene biosynthetis and in the l-methionine salvage cycle II (in plants www.nature.com/scientificreports/ an ACC synthase 1. In plants, the methionine is a precursor of ethylene synthesis. Therefore, this biosynthetic pathway is directly connected to the methionine cycle 40,41 . Although Glyma.14g003200 and Glyma.14g003400 were both annotated as being involved in cysteine biosynthesis, again, no direct link was found between these genes and cystheine biosynthesis through previous studies.
As none of the SNPs located inside the coding region of our candidate genes were predicted to cause a functional change, it is likely that the alleles of this gene differ in expression, thus resulting in the observed differences in sulfur-containing amino acids. Taken together, these data strongly suggest that the phenotypic variation observed in this collection of Canadian lines is due to variants located outside the coding region that likely result in differences in gene expression.

Materials and methods
Plants material and field trials. A set of 137 lines thought to constitute a core set representative of the extent of genetic diversity among soybean varieties belonging to maturity group 000-II ( MG000-II) in Canada was selected from a larger group of 304 accessions based on the analysis of population structure as previously described in 42  Phenotyping and statistical analysis. For sulfur amino acid content for both mapping and validation population, 25-g samples (from each replicate) of whole seeds were analyzed using near-infrared spectroscopy (Perten NIR DA 7250, Perten Instruments). The NIR used was calibrated according to manufacturer's calibration module based on set of 2646 samples. After sample scanning, the concentration of sulfur amino acids was expressed on the basis of the dry matter percentage. Each value was then normalized per kg of total protein (g/kg TP). Three traits were considered in the ensuing analyses: cysteine content (Cys), methionine content (Met) and the sum of both (Cys + Met). Statistical analyses on amino acid content data were performed using META-R 43 . Descriptive statistical analysis and a mixed linear model were used to determine genotypic variance, environment and genotype × environment effects as well as a Shapiro-Wilk test for normal distribution of the phenotype. The two sites were considered as two environments and all effects were assumed random, the broad-sense heritability for a given trait across the two environments was calculated as follows: where σ 2 g and σ 2 ge are the genotype and genotype × environment interaction variance component, nEnv is the number of environments,σ 2 e is error variance components nrep is the number of replicates. Restricted maximum likelihood was used to estimate the BLUP to provide a single value for each variety. Pearson's pairwise and genetic correlations among the traits were estimated using META-R, the genetic correlation was calculated using equations from Cooper and Delacy 44 where σ g jj ′ is the arithmetic mean of all pairwise genotypic covariances between trait j and j′ , σ g(j) σ g(j ′ ) is the arithmetic average of all pairwise geometric means among the genotypic variance components of the traits.
Genotyping and quality control. A total of ~ 203 million 100-bp Illumina HiSeq2000 single-end reads derived from sequencing 192-plex GBS libraries were available for the 137 lines as described previously by 42 . Briefly, DNA extraction was performed using the Qiagen DNeasy 96 Plant kit, The ApeK1 restriction enzyme was used for library preparation as per 45 . These archived reads were processed anew to benefit from an improved SNP-calling pipeline (Fast-GBS 46 ) and a more recent annotation of the soybean genome Wm82.a2.v1 47 . Genotypes were called using a minimal read depth of two reads and loci with more than 80% missing data were removed. Imputation was first performed on this catalogue of 56 K GBS-derived SNPs using BEAGLE v5 48 to fill in any missing genotypes. We then used a set of 4.3 million SNPs derived from the whole-genome sequencing of 102 Canadian elite soybean lines 49 as a reference panel to impute all missing loci onto the initial catalogue of GBS-derived SNPs, again using BEAGLE (Fig. S1). Among these 102 resequenced lines, 56 were in common with the association panel described above (i.e. > 40% overlap). The accuracy of imputation of such untyped loci was previously assessed 50 and found to be 96.4%. After imputation of missing loci, VCFtools 51 was used to retain SNPs with a minor allele frequency (MAF) ≥ 0.05 and heterozygosity ≤ 0.1, thus producing a catalogue of 2.18 M SNPs.
Population structure. The resulting marker dataset was pruned by excluding SNPs in moderate linkage disequilibrium (LD) (r 2 > 0.5) using PLINK 52 to obtain a set of 14 K markers. Population structure was characterized using fastSTRU CTU RE 53 with the number of subpopulations (K) being set from 1 to 12 with 3 independ-H 2 = σ 2 g σ 2 g + σ 2 ge /nEnv + σ 2 e /(nEnv × nrep) , ρ g = σ g(jj ′ ) σ g(j) σ g(j ′ ) , Scientific Reports | (2020) 10:21812 | https://doi.org/10.1038/s41598-020-78907-w www.nature.com/scientificreports/ ent runs of each. The most likely K value was determined using a python script ("choosek") implemented in fastSTRU CTU RE based on the rate of change in LnP between successive K values. In order to better support the number of sub-populations, we used two additional tools: (i) a bootstrap consensus phylogenetic tree (2000 replicates) was constructed using the maximum likelihood method based on the Tamura-Nei model implemented in MEGA7 54 ; and (ii) a principal component analysis (PCA) was performed using GAPIT 55 .
Genome-wide association analysis. Genome-wide association between markers and the phenotypes was assessed in GAPIT and mrMLM R package v4.0 using a catalogue of 243 K SNPs and the best linear unbiased predictors (BLUP) values for each trait. Two GWAS approaches included one single-locus methods (MLM) and six multi-locus (mrMLM; FASTmrMLM, pLARmEB, ISIS EM-BLASSO, FASTmrEMMA and pKWmEB) methods were used. For significant SNP markers detection, a significant threshold, FDR ≤ 0.05 and LOD ≥ 3 were used, respectively for MLM and multi-locus methods. To account for population structure and genetic relatedness between the lines, we used the Q matrix obtained from fastSTRU CTU RE and a K matrix (estimated using the EMMA algorithm 28 ) as covariates. To distinct different QTL/QTN, significantly associated SNPs from the same genomic region were grouped together to form haplotype blocks, as per the LD-based method 27 , using the complete 2.18 M SNP catalogue. To be chosen as a haplotype block of interest, a block had to contain at least one peak SNP significantly associated with sulfur amino acid content. The extent of each block was used to retrieve all genes and their annotation.
Candidate genes and their functional analysis. Only genes residing within haplotype blocks containing a peak SNP were extracted from SoyBase and were examined for their GO annotations. In order to provide more information about potential candidate genes, the electronic fluorescent pictograph (eFP) Browser (www. bar.utoro nto.ca) for soybeans was used to identify in what tissues and at which developmental stages candidate genes were expressed (based on the transcriptomic data of Severin et al. 56 . We also looked for potential lossof-function (LOF) alleles among the list of candidate genes by inspecting the catalogue of structural variants reported by Torkamaneh et al. 57 and by examining the predicted impact of the full set of nucleotide variants located within genic regions using SnpEff 58 . The search for such LOF alleles was limited to the 56 lines for which WGS data were available.