Logo of pnasPNASInfo for AuthorsSubscriptionsAboutThis Article
Proc Natl Acad Sci U S A. Dec 21, 2010; 107(51): 22032–22037.
Published online Dec 3, 2010. doi:  10.1073/pnas.1009526107
PMCID: PMC3009785
From the Cover
Agricultural Sciences

Whole-genome sequencing and intensive analysis of the undomesticated soybean (Glycine soja Sieb. and Zucc.) genome


The genome of soybean (Glycine max), a commercially important crop, has recently been sequenced and is one of six crop species to have been sequenced. Here we report the genome sequence of G. soja, the undomesticated ancestor of G. max (in particular, G. soja var. IT182932). The 48.8-Gb Illumina Genome Analyzer (Illumina-GA) short DNA reads were aligned to the G. max reference genome and a consensus was determined for G. soja. This consensus sequence spanned 915.4 Mb, representing a coverage of 97.65% of the G. max published genome sequence and an average mapping depth of 43-fold. The nucleotide sequence of the G. soja genome, which contains 2.5 Mb of substituted bases and 406 kb of small insertions/deletions relative to G. max, is ~0.31% different from that of G. max. In addition to the mapped 915.4-Mb consensus sequence, 32.4 Mb of large deletions and 8.3 Mb of novel sequence contigs in the G. soja genome were also detected. Nucleotide variants of G. soja versus G. max confirmed by Roche Genome Sequencer FLX sequencing showed a 99.99% concordance in single-nucleotide polymorphism and a 98.82% agreement in insertion/deletion calls on Illumina-GA reads. Data presented in this study suggest that the G. soja/G. max complex may be at least 0.27 million y old, appearing before the relatively recent event of domestication (6,000~9,000 y ago). This suggests that soybean domestication is complicated and that more in-depth study of population genetics is needed. In any case, genome comparison of domesticated and undomesticated forms of soybean can facilitate its improvement.

Keywords: massively parallel sequencing, sequence variation, wild soybean, divergence, genome duplication

Wild soybean (Glycine soja Sieb. and Zucc.) and cultivated soybean [G. max (L.) Merr.] belong to the subgenus Soja within the genus Glycine Willd. of the family Leguminosae, which includes alfalfa (Medicago sativa), pea (Pisum sativum), common bean (Phaseolus vulgaris), peanut (Arachis hypogaea), and lentil (Lens culinaris) (1). G. soja is generally considered to be the closest wild relative of G. max (1). G. soja and G. max both have 20 chromosomes (2n = 40), hybridize easily, exhibit normal meiotic chromosome pairing, and generate viable fertile hybrids (1). However, wild and cultivated soybeans differ with respect to several plant morphological characteristics: Wild soybean grows in the form of creepers with many lateral branches, flowers later than cultivated soybean, produces tiny black seeds rather than large yellow seeds, and its pod shatters easily, promoting long-distance seed dispersal (2). These phenotypic differences between G. max and G. soja, defined as domestication-related syndromes, have not been genetically dissected to identify their underlying genetic components, although dozens of genomic regions for relevant traits have been reported in studies involving quantitative trait locus mapping (24).

The genomes of more than 10 plant species have been sequenced so far (5). Among them, 6 species, rice (Oryza sativa) (6), grapevine (Vitis vinifera) (7), sorghum (Sorghum bicolor) (8), cucumber (Cucumis sativus) (9), maize (Zea mays) (10), and soybean (G. max) (11), are commercial crop plants. These genome sequences have provided powerful tools to characterize the genes responsible for valuable traits in crops and related species. However, the genome sequence of a single crop strain does not allow an understanding of the origins of separate genes involved in complex traits. Furthermore, it is challenging to decipher the processes involved in crop domestication from germination to flowering. Thus, the genome sequences of wild species should provide key information about the genetic elements involved in speciation and domestication. Compared with G. max, limited DNA sequence resources are available for G. soja, which had only 18,808 expressed sequence tags and 2,443 transcript assemblies in GenBank as of July 2010.

Recently, next-generation sequencing platforms have enabled the generation of several orders of magnitude more sequence data within a relatively short time than the traditional Sanger method (12, 13). Among them, two massive parallel sequencing (MPS) platforms, Illumina Genome Analyzer (Illumina-GA) and Roche Genome Sequencer FLX (GS-FLX), using reversible terminators and pyrosequencing, respectively, are in widespread use for genomic, biological, and medical studies (12, 13). Using these MPS technologies, whole-genome resequencing was performed to complete genome sequencing of several species including Caenorhabditis elegans (14), humans (1519), and Arabidopsis (20). In humans, five genome sequences in addition to the HuRef sequence have been reported for different ethnic groups, and they have enabled the identification of nucleotide differences putatively underlying complex traits. Moreover, the Drosophila Genetic Reference Panel, the human 1000 Genomes Project, and the Arabidopsis 1001 Genomes Project are in progress with goals such as the development of patient-tailored medicine and comprehensive understanding of natural variation (2123).

In this study, we sequenced the whole genome of wild soybean (G. soja) using two MPS platforms, Illumina-GA and GS-FLX, and analyzed G. soja genome sequences in detail to catalog the wealth of genomic variation between G. max and G. soja. This extensive analysis offers a primary glimpse of soybean domestication history, which suggests that divergence of G. soja and G. max predated soybean domestication.


Sequence Generation and Alignment.

G. soja var. IT182932 genomic reads generated by MPS were assembled using a draft G. max genome sequence (Glyma1.01, 937 Mb excluding gaps; http://www.phytozome.net/soybean.php) (11) as a reference. The single- and paired-end DNA sequences of 48.8 Gb produced by Illumina-GA yielded a 52.07-fold sequence coverage (SI Appendix, Table S1). Using Mapping and Assembly with Quality (MAQ) software (http://maq.sourceforge.net/index.shtml), 90.91% of ~782 million short reads (36 or 75 bp) were mapped to Glyma1.01. The final mapped sequence length of 915.4 Mb covered 97.65% of G. max, a 43.03-fold mapping depth. To validate SNPs/indels and large deletions detected in the MAQ alignments, 3.36 Gb of longer DNA reads (on average 250 or 400 bp) generated by GS-FLX were aligned using GS Reference Mapper software (ver. 2.0; Roche) (SI Appendix, Table S1). About 98.30% of the 10.9 million GS-FLX reads were mapped to Glyma1.01, resulting in 57.0% coverage at an average depth of 3.45-fold.

Prediction of Duplicated Regions in the G. soja Genome.

Aligning the short but redundant DNA reads against the reference enabled the prediction of duplicated genomic regions by examining regions with higher than average coverage (Fig. 1A) (14, 19). For Illumina-GA sequencing, it was also necessary to compensate for the physical bias toward GC content (SI Appendix, Fig. S1). The estimation of overrepresented regions by copy number revealed that about 80% of the G. soja genome was duplicated (Fig. 1B). This high level of duplication was expected, as the soybean genome has undergone two polyploidy events, 14 and 59 million years ago (Mya) (13). The putatively duplicated genomic regions were as large as 534 Mb (62.82% of the entire genome), and 112 Mb (13.18%) and 22 Mb (2.59%) of the genomic regions exhibited three- and fourfold redundancy, respectively (Fig. 1B).

Fig. 1.
Prediction of duplicated regions of G. soja using read coverage and validation of large deletions by GS-FLX reads. (A) Identification of duplicated regions on chromosome 1 by higher-than-expected coverage compared with expected coverage. Black and red ...

SNP and Indel Identification.

About 2.5 million SNPs between G. max and G. soja were predicted (supported by more than four reads) via multiple stringent filtering criteria; 2,167,728 (86.5%) and 337,257 (13.5%) SNPs were located in nongenic and genic regions, respectively (Table 1). One kilobase upstream of transcription start sites in 33,135 genes harbored 117,394 SNPs. Noncoding genic regions contained 251,021 SNPs (74.43% of the SNPs in genic regions) with 27,409 in 5′ and 3′ untranslated regions and 223,612 in introns. A total of 86,236 SNPs (25.57% of the SNPs in the genic regions), consisting of 47,638 synonymous and 38,598 nonsynonymous SNPs, were classified as coding sequence variants. These findings indicated that 35.6% (16,519 out of 46,430) of the high-confidence gene set are affected by nonsynonymous SNPs in the G. soja genome. The frequency of SNPs in both wild and cultivated soybeans was 2.67 SNPs per 1 kb, consistent with previous studies of G. max varieties (2426). The G. soja genome contained 196,356 indels (−35 to +14 bp) compared with Glyma1.01, and 78.54% (154,211) of the indels were located in nongenic regions (Table 1). Single-base-pair indels were the most frequent (63.5%, 124,605) type (SI Appendix, Fig. S2). Only 21.46% (42,145) of the indels were positioned within genic boundaries (Table 1). The 2,398 indels in coding sequences caused frameshifts in 2,235 genes, which is a reasonable number of frameshift variants for the soybean genome in comparison with a previous study in Arabidopsis (20). Further examination revealed that indels were located throughout the G. soja genome at a density of 1 indel per 4.8 kb.

Table 1.
Summary of differences between the G. soja and G. max genomes

Functional Prediction of Nonsynonymous SNPs.

The nonsynonymous SNPs were subdivided using PolyPhen (Polymorphism Phenotyping) (27) and SIFT (Sorting Intolerant from Tolerant) (28) predictions on their effects on protein function and structure (SI Appendix, Table S2). Overall, about 20% of the nonsynonymous SNPs were predicted to be functionally relevant. The 10,325 nonsynonymous SNPs in 9,082 genes were predicted to be nonconservative missense (SI Appendix, Table S3). We also identified 1,945 nonsynonymous SNPs positioned in the start or stop codon of 1,791 G. max genes. Considering the G. soja genome as a reference, 138 SNPs created new stop codons in G. max, which potentially resulted in functional defects in 132 genes (SI Appendix, Table S3). Meanwhile, relative to G. soja, G. max gained 1,146 novel start codons and 661 disrupted stop codons that may result in the acquisition of additional coding sequences.

Structural Variation Identification.

MAQ paired-end alignment data were used to assess structural variation between G. max and G. soja. We detected 5,794 deletions and 194 inversions in the range of 0.1~100 kb and predicted the presence of 8,554 insertions in the G. soja genome (Table 1; SI Appendix, Tables S4S6, Dataset S1). Deletions in the range of 100–500 bp were enriched relative to larger sizes (SI Appendix, Fig. S3), which is similar to MPS findings in humans (15). This enrichment likely occurred, in part, because of sample-size bias of the short reads. About 48.11% of the deleted regions in the G. soja genome contained repetitive elements including the most abundant classes of retrotransposons (20.74%) (SI Appendix, Table S7). In the range of 1~2 kb and 10~20 kb, about 50% of the deletion events involved retrotransposons (SI Appendix, Fig. S3). The 32-Mb fragments present in G. max but absent in G. soja harbor 712 coding sequences in 555 deletion events (Table 1), indicating an increasing copy number of the existing genes and/or novel genes in cultivated G. max. A few examples of gene structure collapse from deletion or inversion events in G. soja are shown in SI Appendix, Fig. S4. Despite the detection of a number of structural variations between G. max and G. soja in this study, this resequencing strategy did not enable prediction of chromosomal translocation between the two genomes (29). Reciprocal translocation between chromosomes (Chrs) 11 and 13 was found in two wild soybeans. However, this translocated sequence, which originally presented on Chr 13 in G. max var. Williams 82, was mapped on Chr 13 of the G. soja genome. This indicated a limitation of the resequencing strategy for detecting chromosomal translocations.

Validation of SNPs/Indels and Large Deletions by GS-FLX.

To evaluate the consistency rate of MAQ-called SNPs and indels, SNP/indel prediction was performed with GS-FLX-mapped long sequences using GS Reference Mapper software. The 328,680 GS-FLX SNPs were identified in the same positions as MAQ-called SNPs (SI Appendix, Table S8). Of the shared SNPs, 328,643 were concordant, with a level of consistency of 99.99%. In the 37 discordant SNPs, it was ascertained which SNP calls were accurate by dideoxy sequencing. Fifteen sequence-specific primers for these loci were designed (SI Appendix, Table S9), whereas primers for the remaining 22 loci could not be designed because the loci contained repetitive sequences or gaps. Among the tested loci, Illumina-GA calls of four loci were concordant with Applied Biosystems Sanger calls. GS-FLX SNP calls for 11 loci were concordant with Applied Biosystems Sanger calls. Of the 4,677 indels (size 2~4 bp) detected by both programs, 98.82% (4,622) were consistent. With 4,460 GS-FLX long reads, 2,187 large deletions spanning 13.5 Mb were also confirmed by investigating the 50-bp overlaps neighboring the deletion site (Fig. 2; SI Appendix, Fig. S5 and Table S4, Dataset S1), indicating high prediction fidelity for structural deletion variants.

Fig. 2.
Validation of large deletions by GS-FLX reads. An example of a 634-bp deletion region in G. soja validated by GS-FLX reads (Gm02, 10126582~10127216). The first and second tracks indicate reads mapped to sequences flanking the deleted region from ...

De Novo Assembly.

We identified 32.4 Mb of G. max-specific sequences (absent in G. soja) by large deletion detection. These results indicated that the G. soja consensus sequence covered most of the reference Glyma1.01, ignoring these unmappable regions. In addition, 32,262 contigs spanning 8.2 Mb were assembled de novo (SI Appendix, SI Text and Table S10). Among them, 524 contigs (31~3,148 bp) were incorporated into the G. soja consensus (SI Appendix, Fig. S6 and Table S11), where 102 putative genes were found by searching the GenBank protein database (SI Appendix, Table S12). Of 38 successfully amplified and dideoxy-sequenced contigs, 25 were found to be G. soja-specific and 13 to be G. max-unsequenced gaps (SI Appendix, Table S13).

Effect of Mapping Depth and Chromosomal Distribution of SNPs/Indels.

The chromosomal distribution of the identified SNPs and indels between G. max and G. soja was nonuniform because fewer SNPs and indels were predicted than detected in pericentromeric regions that are highly repetitive in soybean (Fig. 3A; SI Appendix, Fig. S7), which were inconsistent with previous studies (14, 17). Nucleotide variants for which the mapping depth exceeded 200 were filtered out in the pericentromeric regions to raise the quality of variation detection. Accordingly, DNA variant discovery using MPS was confined to less repetitive regions. Because pericentromeric regions had little genic content, however, precise detection of the variation in functional analyses using MPS technology may be unnecessary. Genome coverage was directly proportional to the mapping depth of short DNA reads (Fig. 3B). For a read threshold ≥1, the genome coverage reached a plateau of 98% at a mapping depth of 20-fold, whereas read thresholds of ≥3 and ≥5 showed genome coverage plateaus at a mapping depth of 30-fold. Most of the SNPs were identified at a mapping depth of 30-fold for read thresholds ≥3 and ≥5, but SNP discovery increased gradually up to a mapping depth of 42-fold (Fig. 3B). These results indicated the theoretical maximum of genome coverage and SNP calling at the effective mapping depth reached in this G. soja genome sequencing using MPS technology. However, multisite mapping of short reads resulting from genome duplications might overestimate SNP calls in paralogous regions. It should be noted that the identification of unique and accurate SNPs in G. soja required a much higher read threshold than in less duplicated genomes (14, 1620).

Fig. 3.
Distribution of sequence variation on Chr 1 of G. soja. (A) Black and red lines indicate total SNPs and number of indels, respectively. Gene numbers and numbers of nonsynonymous SNPs on Chr 1 are shown as corresponding colored bars. The gray area represents ...

Genomic Difference and Divergence Between G. max and G. soja.

Of the 937.5-Mb genome sequences examined, 0.31% differed between G. max and G. soja. This percentage increased to 3.76% when large deleted sequences were also considered (Table 1). In addition, the theoretical divergence time was estimated between the genomes of IT182932 (G. soja) and Williams 82 (G. max) by calculating genetic divergence and showed that G. soja and G. max diverged at 0.267 ± 0.03 Mya assuming a distribution of synonymous substitution patterns (Ks) with 6,780 synonymously changed neutral genes (SI Appendix, Fig. S8). Although a divergence time based on the nucleotide sequences of only two genotypes could be an overestimate, these results suggest that the divergence between IT182932 and Williams 82 predated soybean domestication. G. max is essentially a domesticated form of G. soja. Thus, our data suggest that the G. soja/G. max complex is at least 270,000 y old (Fig. 4). It is widely accepted that there would be no undomesticated G. max without domestication; however, given that the domestication of soybean likely occurred 6,000~9,000 y ago (1, 30), genetic divergence clearly predates domestication. In rice, molecular phylogenetic studies have also suggested that the genomes of cultivated O. sativa L. ssp. indica and O. sativa ssp. japonica were derived from those of independent wild populations that had already diverged before domestication at 0.2–0.4 Mya (3133). Thus, genome comparison suggests that the genetic history of soybean is more complicated than previously assumed and, to determine the origin of domesticated G. max, additional studies are needed.

Fig. 4.
Soybean domestication history. G. max is generally believed to have been domesticated from its wild relative, G. soja, 6,000~9,000 y ago. The G. soja/G. max complex diverged from a common ancestor at 0.27 Mya. Divergence between G. soja and G. ...


The Second Legume Genome Sequence.

Among about 20,000 species in the Leguminosae, only two model legumes, Medicago sativa (http://www.medicago.org/genome) and Lotus japonicus (http://www.kazusa.or.jp/lotus), with small numbers of chromosomes are being sequenced. The soybean genome sequence released recently was the first legume sequence using an elite cultivar, Williams 82 (11). For the whole-genome sequencing of wild soybean (G. soja), a resequencing strategy was used to take advantage of the genome sequence (Glyma1.01) of the nearest sequenced relative, G. max. Previous resequencing projects, including those of Homo sapiens (1519) and A. thaliana (20), used the same species as a reference. G. max and G. soja were morphologically quite different, but they were close genomically, as revealed by the mapping of more than 90% of G. soja reads to Glyma1.01 (SI Appendix, Table S1). Such similarity in species differs from the interspecific variation observed in rice. Rice is reported to have experienced dynamic genome evolution among 10 Oryza species (34). Even though Oryza has two reference genome sequences, African cultivated rice (O. barthii) is being sequenced through de novo next-generation sequencing instead of resequencing (35) because it is distinct from the sequenced O. sativa. Although it is arguable whether G. soja is a completely different species from G. max, it was possible to resequence the G. soja genome using the G. max genome as a reference. The 915.4-Mb genomic consensus sequence of wild soybean was produced, covering 97.65% of the G. max genome.

The 0.31% Genomic Difference Between Wild and Cultivated Soybeans.

The SNPs and indels in precisely aligned areas differed by 0.31% between G. max and G. soja. This difference is less than differences between Arabidopsis accessions (20) and between O. sativa ssp. indica and O. sativa ssp. japonica (6). Although G. soja and G. max were classified as distinct species before their genomic sequences were determined, the sequences of the G. soja and G. max genomes presented here prove that these species are close relatives. The whole-genomic difference between wild and cultivated soybeans is similar to that between human and chimpanzee genomes, which has been reported to be ~4% (36). However, humans and chimpanzees have a single-nucleotide sequence difference of ≈1% (35 million SNPs in 2.4 Gb) (36). It was found that G. soja and G. max had a single-nucleotide difference of 0.31% and the portion of genomic structural variation resulting from deletion events in G. soja was relatively high (3.45%). About 21% of the regions present in G. max but absent in G. soja contained transposable elements (TEs). These results are similar to those of a previous study showing that the genomes of domesticated species can expand with the dramatic amplification of TEs (37).

Validation of MAQ Predictions by GS-FLX Sequencing.

GS-FLX sequences were used to validate MAQ predictions of SNPs, indels, and structural deletion variations based on Illumina-GA read alignments to the G. max reference sequence. A concordance of 99.99% was confirmed using a dataset of 0.3 million SNPs common to both Illumina-GA and GS-FLX platforms. We also validated 3,336 small indels and 2,187 large deletions using the GS-FLX long reads. In human resequencings, the identified SNPs were confirmed using a high-throughput SNP beadarray (1719) and Applied Biosystems Sanger sequencing was used to validate the limited number of indels (17). To confirm structural variation, BAC or fosmid sequencing and comparative genomic hybridization array analysis were used (16, 17). Because the G. soja genome has not previously been characterized, few genomic resources, such as BAC libraries or commercial SNP chips, are available. Accordingly, GS-FLX sequencing was used to validate the Illumina-GA results, and this approach was successful. Thus, to sequence nonmodel organisms using short-read sequencing technologies, GS-FLX sequencing can be used for genome-wide validation of the identified DNA variants.

Putatively Functional Variants Between G. max and G. soja.

Sequencing of the G. soja genome generated a large dataset of sequence variants between G. soja and G. max. Notably, 1,945 nonsynonymous SNPs identified caused start or stop codon changes of 1,791 genes, which are expected to have drastic effects on their transcripts. Because these could be direct and rich resources for genetic studies of phenotypic differences between two species, this information is being applied to define genes linked with phenotypic variation using the constructed genetic map of Hwangkeumkong (G. max) x IT182932 (G. soja) (38).

Nucleotide variation in the regions 1 kb upstream of transcription start sites can affect transcription factor binding. More than 100,000 nucleotide variations in thousands of genes were identified between G. max and G. soja (Table 1). Nucleotide polymorphisms in promoter regions can alter gene function without changing existing gene structure. Thus, these variations are important parameters in the study of complex phenotypic variation such as the soybean domestication syndrome, which includes loss of seed shattering by an SNP 12 kb upstream of the qSH1 gene (37).

Complete or partial gain and loss of various genes as a result of large structural variations were also observed between G. soja and G. max. The 32.4 Mb of G. max-specific sequences in 712 genes were partially or completely absent in G. soja by 555 deletion events (Table 1; SI Appendix, Fig. S4). This suggests that cultivated soybean acquired a considerable number of genes absent in its wild relative during domestication and improvement, or these genes might be lost during diversification of G. soja. Besides coding sequences, LTR retrotransposons (present in 17.28% of the deleted regions in G. soja) were the most predominant retrotransposon type (SI Appendix, Table S7 and Fig. S3). It has been shown that the genomes of domesticated species can be expanded by dramatic amplification of TEs (39). Furthermore, retrotransposon insertions have been reported to collapse or silence genes, as exemplified by phenotypic changes in tomato fruit shape (40). Thus, the frequent LTR insertions into the G. max genome can facilitate understanding of soybean domestication.

Genome Duplication in Soybean.

Soybean shows vestiges of ancient polyploidy (paleopolyploidy) events in its genome. An allopolyploid event shared with the Medicago lineage, originating from the hybridization of two different ancestral legumes (2n = 20) at 59 Mya, was followed by another independent whole-genome duplication at 14 Mya (30). Tetrad homeologous regions are expected to exist in soybean genomes (41, 42), but divergence among these homeologous regions accumulated over time, resulting in triplet or paired homeologous regions. In particular, most duplicated genes generated by duplication events have high sequence homologies.

Mapping of short reads to the G. max reference genome revealed that sequence similarity between homeologous regions led to much higher coverage in these regions. Read coverage has been used as a criterion for region-specific copy-number measurements in C. elegans and Arabidopsis (14, 20). The ratio of observed to expected coverage was surveyed in 100-bp windows and the genomic regions present at a copy number of 1–10 were identified. Genomic regions present as a single copy comprise only 18.36% of the G. soja genome, indicating that the remaining 81.64% of the genome is duplicated (Fig. 1).

These paramorphisms in duplicated regions prevented the assignment of accurate bases in the consensus and in the identification of unique SNPs. To alleviate these problems arising from genome duplication, we estimated the mapping depth required in Illumina-GA short-read sequencing for maximum genome coverage and SNP calls. At an average mapping depth of more than 20-fold, the G. soja consensus covered 98% of the G. max reference genome (Fig. 3B). However, when the read threshold was set to 5 or higher, the number of identifiable SNPs dropped and a mapping depth over 30-fold was required (Fig. 3B). Because polyploidy is highly pervasive in crop plants, a mapping depth >30 is required for resequencing using MPS technology in other crops.

Soybean Domestication.

Soybean is generally considered to have been domesticated from its wild relative (G. soja) 6,000~9,000 y ago in China (1). The exact region of origin of soybean is still unknown, but southern China, the Yellow River valley of central China, northeastern China, and several other regions are all candidate sources (1) because G. soja grows naturally in far eastern Russia, China, Korea, and Japan (1). Few genetic studies have examined soybean domestication and the phenotypic differences between domesticated soybean and its wild progenitor. Wild soybean has been considered simply as a breeding parent with useful traits, such as disease and pest resistance. Using breeding populations derived from crosses between wild and cultivated soybeans, several quantitative trait loci responsible for domestication-related traits have been reported (2). However, no genes coding for domestication-related traits have been isolated, even though great effort has been made to understand which genetic elements contributed to domestication and how genetic divergence of wild and cultivated species occurred. Numerous methods of measuring genetic divergence were available, but the number of synonymous substitutions between orthologous genes was used to estimate the neutral genetic drift distance between these two species because the genomes of wild and domesticated soybean are similar. The synonymous distances of 6,780 genes were distributed from Ks values of 0–0.099 (SI Appendix, Fig. S7). Using a peak Ks value of 0.003, the divergence between G. soja and G. max was estimated to have occurred at 0.267 ± 0.03 Mya. However, the archaeologically established domestication time of soybean is much less than 270,000 y. There are many possible reasons for this discrepancy. First, we compared only two single genomes rather than two or more genome populations from each species. It is possible that the G. max genome contains genetic material from multiple species involved in the domestication process. This could result in the relatively large genetic distance observed between G. max and known wild species of soybean. Second, G. soja used in our study has diverged for over 270,000 y from G. max, which has spread widely since it was first cultivated 6,000~9, 000 y ago (Fig. 4). In this case, there may be an undomesticated population of G. max somewhere in East Asia that is the direct ancestor of the domesticated one. Third, the selected marker set may not be reliable in terms of the number and type of markers. To select a good set of haplotype blocks, it is usually necessary to perform extensive SNP block variation studies and/or sequence multiple genomes. Fourth, although an attempt to remove theoretically naturally selected mutations was made by using only neutral genes, the human selection factor still remains, leading to an exaggerated divergence time. Regardless of which explanation accounts for the discrepancy, it is necessary to sequence more soybean genomes and to compare these genomes to better understand the origins of domesticated soybean and the processes underlying domestication.

In summary, using a combination of two next-generation sequencing strategies, we demonstrate the feasibility of aligning and assembling the G. soja genome to the G. max reference genome. Also, SNPs, indels, and large structural variations detected in this study represent a comprehensive list of genetic changes, some of which may underlie the domestication of G. soja. Our results suggest that the G. soja/G. max complex may have been present at 0.27 Mya and the divergence of G. soja and G. max predated domestication. This genome-wide comparison of domesticated and undomesticated forms allows systematic cataloging of the whole repertoire of genetic variants. Our approach with next-generation sequencing platforms and the reference sequence from related species could aid in the development of structural and functional genomic research in soybeans and other legumes.

Experimental Procedures

Plant Material.

Wild soybean IT182932 (G. soja) was collected from a wild field in Yong-In City, Gyeonggi Province, South Korea, in 1994. After multiple self-pollination events aimed at reducing heterozygous loci, microsatellite analysis of several IT182932 plants was used to select homozygous individuals to serve as mapping parents for the construction of a genetic linkage map of Hwangkeumkong (G. max) x IT182932 (42).

G. soja Sequencing and Analysis.

The wild soybean (G. soja var. IT182932) genome was studied using two MPS technologies: Illumina-GA for single- and paired-end sequencing and GS-FLX for pyrosequencing. Sequences were aligned with default parameters using the MAQ program (ver. 0.7). After the short reads were aligned against the G. max genome (Glyma1.01), a consensus sequence was used for SNP calling with options. MAQ was also used to detect short indels. Structural variations were detected when a paired-end read had an anomalously long span size with a minimum of three reads. GS Reference Mapper software (ver. 2.0) with default parameters was used for alignment of GS-FLX reads. The predicted deletion regions were validated with Illumina-GA reads using the BLAT program (43). Repetitive elements in deletion regions were identified by RepeatMasker software (http://repeatmasker.org), and de novo assembly of unmapped and unpaired reads was accomplished with the Velvet program (ver. 0.7.31) (44). Further details can be found in SI Appendix, SI Experimental Procedures.

Supplementary Material

Supporting Information:


We thank the US Department of Energy-Joint Genome Institute for providing the annotated gene list, which was recently updated. This work was supported primarily by the Korea Rural Development Administration, BioGreen 21 Project 20080401034010 to S.-H.L. (International Atomic Energy Agency 12613/R5 to S.-H.L.), and the Korea Research Institute of Bioscience and Biotechnology Research Initiative Program funded by the Ministry of Education, Science, and Technology, the Republic of Korea (to J.B.).


See Commentary on page 21947.

*This Direct Submission article had a prearranged editor.

The authors declare no conflict of interest.

Database deposition: The sequence data from this study have been deposited in the National Center for Biotechnology Information Short Read Archive, www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi (accession no. SRA009252).

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1009526107/-/DCSupplemental.


1. Carter TE, Jr., Nelson R, Sneller CH, Cui Z. Genetic diversity in soybean. In: Boerma HR, Specht JE, editors. Soybeans: Improvement, Production and Uses. Madison, WI: Am Soc Agron; 2004. pp. 303–416.
2. Liu B, et al. QTL mapping of domestication-related traits in soybean (Glycine max) Ann Bot (Lond) 2007;100:1027–1038. [PMC free article] [PubMed]
3. Zhang WK, et al. QTL mapping of ten agronomic traits on the soybean (Glycine max L. Merr.) genetic map and their association with EST markers. Theor Appl Genet. 2004;108:1131–1139. [PubMed]
4. Kang S-T, et al. Population-specific QTLs and their different epistatic interactions for pod dehiscence in soybean (Glycine max (L.) Merr.) Euphytica. 2009;166:15–24.
5. Sasaki T, Antonio BA. Plant genomics: Sorghum in sequence. Nature. 2009;457:547–548. [PubMed]
6. International Rice Genome Sequencing Project The map-based sequence of the rice genome. Nature. 2005;436:793–800. [PubMed]
7. Jaillon O, et al. French-Italian Public Consortium for Grapevine Genome Characterization The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature. 2007;449:463–467. [PubMed]
8. Paterson AH, et al. The Sorghum bicolor genome and the diversification of grasses. Nature. 2009;457:551–556. [PubMed]
9. Huang S, et al. The genome of the cucumber, Cucumis sativus L. Nat Genet. 2009;41:1275–1281. [PubMed]
10. Schnable PS, et al. The B73 maize genome: Complexity, diversity, and dynamics. Science. 2009;326:1112–1115. [PubMed]
11. Schmutz J, et al. Genome sequence of the palaeopolyploid soybean. Nature. 2010;463:178–183. [PubMed]
12. Mardis ER. Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet. 2008;9:387–402. [PubMed]
13. Shendure J, Ji H. Next-generation DNA sequencing. Nat Biotechnol. 2008;26:1135–1145. [PubMed]
14. Hillier LW, et al. Whole-genome sequencing and variant discovery in C. elegans. Nat Methods. 2008;5:183–188. [PubMed]
15. Wheeler DA, et al. The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008;452:872–876. [PubMed]
16. Bentley DR, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456:53–59. [PMC free article] [PubMed]
17. Kim J-I, et al. A highly annotated whole-genome sequence of a Korean individual. Nature. 2009;460:1011–1015. [PMC free article] [PubMed]
18. Ahn S-M, et al. The first Korean genome sequence and analysis: Full genome sequencing for a socio-ethnic group. Genome Res. 2009;19:1622–1629. [PMC free article] [PubMed]
19. Wang J, et al. The diploid genome sequence of an Asian individual. Nature. 2008;456:60–65. [PMC free article] [PubMed]
20. Ossowski S, et al. Sequencing of natural strains of Arabidopsis thaliana with short reads. Genome Res. 2008;18:2024–2033. [PMC free article] [PubMed]
21. The Drosophila Genetic Reference Panel (DGRP) http://service004.hpc.ncsu.edu/mackay/Good_Mackay_site/DBRP.html.
22. Kaiser J. DNA sequencing: A plan to capture human diversity in 1000 genomes. Science. 2008;319:395. [PubMed]
23. Weigel D, Mott R. The 1001 genomes project for Arabidopsis thaliana. Genome Biol. 2009;10:107. [PMC free article] [PubMed]
24. Zhu YL, et al. Single-nucleotide polymorphisms in soybean. Genetics. 2003;163:1123–1134. [PMC free article] [PubMed]
25. Van K, et al. Discovery of single nucleotide polymorphisms in soybean using primers designed from ESTs. Euphytica. 2004;139:147–157.
26. Choi I-Y, et al. A soybean transcript map: Gene distribution, haplotype and single-nucleotide polymorphism analysis. Genetics. 2007;176:685–696. [PMC free article] [PubMed]
27. Ramensky V, Bork P, Sunyaev S. Human non-synonymous SNPs: Server and survey. Nucleic Acids Res. 2002;30:3894–3900. [PMC free article] [PubMed]
28. Ng PC, Henikoff S. SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003;31:3812–3814. [PMC free article] [PubMed]
29. Findley SD, et al. A fluorescence in situ hybridization system for karyotyping soybean. Genetics. 2010;185:727–744. [PMC free article] [PubMed]
30. Gill N, et al. Molecular and chromosomal evidence for allopolyploidy in soybean. Plant Physiol. 2009;151:1167–1174. [PMC free article] [PubMed]
31. Sang T, Ge S. Genetics and phylogenetics of rice domestication. Curr Opin Genet Dev. 2007;17:533–538. [PubMed]
32. Zhu Q, Ge S. Phylogenetic relationships among A-genome species of the genus Oryza revealed by intron sequences of four nuclear genes. New Phytol. 2005;167:249–265. [PubMed]
33. Ma J, Bennetzen JL. Rapid recent growth and divergence of rice nuclear genomes. Proc Natl Acad Sci USA. 2004;101:12404–12410. [PMC free article] [PubMed]
34. Buell CR. Poaceae genomes: Going from unattainable to becoming a model clade for comparative plant genomics. Plant Physiol. 2009;149:111–116. [PMC free article] [PubMed]
35. Rounsley S, et al. De novo next generation sequencing of plant genomes. Rice. 2009;2:35–43.
36. Varki A, Nelson DL. Genomic comparisons of humans and chimpanzees. Annu Rev Anthropol. 2007;36:191–209.
37. Konishi S, et al. An SNP caused loss of seed shattering during rice domestication. Science. 2006;312:1392–1396. [PubMed]
38. Yang K, et al. Genome structure in soybean revealed by a genomewide genetic map constructed from a single population. Genomics. 2008;92:52–59. [PubMed]
39. Naito K, et al. Dramatic amplification of a rice transposable element during recent domestication. Proc Natl Acad Sci USA. 2006;103:17620–17625. [PMC free article] [PubMed]
40. Xiao H, Jiang N, Schaffner E, Stockinger EJ, van der Knaap E. A retrotransposon-mediated gene duplication underlies morphological variation of tomato fruit. Science. 2008;319:1527–1530. [PubMed]
41. Shin JH, et al. The lipoxygenase gene family: A genomic fossil of shared polyploidy between Glycine max and Medicago truncatula. BMC Plant Biol. 2008;8:133. [PMC free article] [PubMed]
42. Kim KD, Shin JH, Van K, Kim DH, Lee S-H. Dynamic rearrangements determine genome organization and useful traits in soybean. Plant Physiol. 2009;151:1066–1076. [PMC free article] [PubMed]
43. Kent WJ. BLAT—The BLAST-like alignment tool. Genome Res. 2009;12:656–664. [PMC free article] [PubMed]
44. Zerbino DR, Birney E. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008;18:821–829. [PMC free article] [PubMed]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences
PubReader format: click here to try


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...