The Glycine max cv. Enrei Genome for Improvement of Japanese Soybean Cultivars

We elucidated the genome sequence of Glycine max cv. Enrei to provide a reference for characterization of Japanese domestic soybean cultivars. The whole genome sequence obtained using a next-generation sequencer was used for reference mapping into the current genome assembly of G. max cv. Williams 82 obtained by the Soybean Genome Sequencing Consortium in the USA. After sequencing and assembling the whole genome shotgun reads, we obtained a data set with about 928 Mbs total bases and 60,838 gene models. Phylogenetic analysis provided glimpses into the ancestral relationships of both cultivars and their divergence from the complex that include the wild relatives of soybean. The gene models were analyzed in relation to traits associated with anthocyanin and flavonoid biosynthesis and an overall profile of the proteome. The sequence data are made available in DAIZUbase in order to provide a comprehensive informatics resource for comparative genomics of a wide range of soybean cultivars in Japan and a reference tool for improvement of soybean cultivars worldwide.


Introduction
Soybean (Glycine max) is one of the world's most important leguminous crops being a major source of edible proteins and vegetable oils. In terms of global production, soybean ranks fourth, following the major cereal crops such as rice, wheat, and corn. It is also a major source of nutritionally and physiologically active substances such as saponins, isoflavones, phytosterols, and tocopherols. Consumption of soybeans as food is largely concentrated in Asia. Soybean has been a part of the Japanese diet and eaten from ancient times as a valuable source of traditional fermented products such as miso, soy sauce, and natto and nonfermented products such as edamame (boiled soybean), kinako (toasted soybean flour), tofu, and soymilk. As in other major crops, the main targets of soybean breeding in Japan are high yield, high quality (absence of seed coat cracking, seed size, hilum color, uniformity of seed size, and food processing adaptability) to compete with imported soybean, and resistance to biotic/abiotic stress for stable production. Additionally, the chemical component of seeds including high protein content, modification of storage proteins, absence of lipoxygenases and saponin, high isoflavone content, and high sucrose have been given much consideration in many soybean breeding programs [1].
The domesticated soybean has its origin from Glycine soja, a wild soybean species found mainly in northern China, Japan, Korea, and the eastern part of Russia [2]. Archaeological studies indicate that the word for soybean first appeared in China about 3,700 years ago in bone inscriptions dating back to the Yin and Shang dynasties and carbonized soybean 2 International Journal of Genomics seeds found about 2,600 years ago [2]. Estimation of archaeological records indicates a widespread early association of small seeded soybean to be as old as 9,000-8,600 calibrated years before the present (cal BP) in northern China and 7,000 cal BP in Japan [3]. Direct radiocarbon dates on charred soybean seeds indicate selection resulted in large seed sizes in Japan by 5,000 cal BP (Middle Jomon) and in Korea by 3,000 cal BP (Early Mumun) [3]. Extensive genome analysis also indicates that the G. soja/G. max complex diverged from the most recent common ancestor at 0.27 Mya [4] or 0.8 Mya [5]. In a more recent study, the genetic variation and population structure among 1,603 soybean accessions indicated a clear genetic differentiation among Japanese soybean landraces, exotic and cultivated soybeans, and wild soybeans [6].
From the genomics point of view, soybean has been used as a model plant for comparative studies of legumes in terms of root nodulation, oilseed production, and secondary metabolism. It is also a valuable material for genome research because of the availability of many genomic and germplasm resources. In 2010 a great deal of effort in the USA culminated with the sequencing of the paleopolyploid soybean genome based on a soybean cultivar Williams 82 [7]. This cultivar was derived from backcrossing a Phytophthora root rot resistance locus from the donor parent Kingwa which was selected in 1921 from the cultivar Peking introduced from Beijing, China, in 1906 [8].
In Japan, however, domestic cultivars have been developed to suit a variety of conditions and applications of specific importance to Japanese growers. Although the G. max cv. Williams 82 reference soybean genome sequence could be useful in understanding the diversity among many cultivars, it is necessary to have genomic resources that could be directly applied to Japanese soybean cultivation. The Japanese soybean cultivar Enrei was derived from cultivars Norin number 2 and Higashiyama number 6 (also known as cv. Shiromeyutaka) and was developed in 1971 at Kikyogahara Branch of the Nagano Agricultural Experiment Station (presently known as Nagano Vegetable and Ornamental Crops Experiment Station) [9]. In this paper, we described the analysis of the genome sequence of the Japanese soybean cultivar Enrei focusing on the phylogenetic analysis and major traits for soybean breeding including anthocyanin and flavonoid biosynthesis and proteome profile.

Genome
Sequencing. The plant material was provided by the Genebank of the National Institute of Agrobiological Sciences (NIAS). High-quality nuclear DNA with reduced organellar DNA was extracted from young leaves using a protocol designed for BAC DNA extraction with some modifications [10]. All sequencing reads were obtained using the Illumina HiSeq2000 at Operon Biotechnologies, Inc. (Eurofins Genomics). Standard short-read libraries and mate-paired libraries with 8 kbp insertion were built using the TruSeq SBS v5 for sequencing runs at 2 × 100 bp or 200 bp total. After sequencing, HiSeq Control Software v.1.4.8 and CASAVA 1.8.1 (Illumina) were utilized for base calling. Single-ended libraries and 3 kbp pair-ended libraries constructed with the GS FLX Titanium General Library Preparation Kit and Rapid Library Preparation Kit (Roche) were sequenced on Roche 454 FLX Titanium at the NIAS, and base calling was performed using the 454 FLX Titanium base caller.

Assembly and Reference
Mapping. We constructed a de novo genome assembly (G. max Enrei1) and reference genome assembly (G. max Enrei2) to facilitate comprehensive analysis of the genome. The G. max Enrei1 assembly was constructed from the Roche 454 FLX Titanium single-ended reads and pair-ended reads with 3 kbp insert, the Illumina HiSeq2000 pair-ended reads with 300 bp insert and matepair reads with 8 kbp insert, and the ABI 3730xl BAC-end reads using the Roche Newbler 2.7.
The G. max Enrei2 assembly was derived from Roche 454 FLX Titanium single-ended reads and Illumina HiSeq2000 pair-ended reads were used for reference mapping with the BWA 0.7.5a (Li H. Aligning sequence reads, clone sequences, and assembly contigs with BWA-MEM, 2013; http://bio-bwa .sourceforge.net/), SAMtools 0.1.19 [11], and NIG script (NGS Surfer's wiki, http://cell-innovation.nig.ac.jp/wiki/tiki-index .php?page=samtools#mpileup ). The G. max cv. Williams 82, also referred to as Gmax275 genome assembly, was used for reference mapping. The pseudomolecules and scaffolds in the G. max Enrei2 were searched for marker sequences by BLASTn (NCBI BLAST, ftp://ftp.ncbi.nih.gov/blast/). Then marker sequences were mapped in the regions with clear sequences, gap regions (indicated as N's in the sequence), BAC-end sequences (BES) hit position, marker hit position, and scaffold derived from de novo assembly hit position. Subsequently, the cutting points were identified to reconstruct the pseudomolecules and scaffolds.

Gene Models.
The soybean parameter files were built from chromosome 16 region of hard masked Gmax275 genome [30,000,000-37,887,014 bps] using Augustus program [12]. The transposable elements in the scaffolds and pseudomolecules of the G. max Enrei2 genome assembly were masked using RepeatMasker (Smit AFA, Hubley R., and Green P. RepeatMasker Open-3.0, 1996-2010, http://www .repeatmasker.org/) and the gene models were built using Augustus. These gene models were used as queries in BLASTn search using the soyTE as a database [13]. The filtered gene models with bit score of 100 and above were selected. Additionally, we used available RNAseq data (PRJDB3582) assembled by Trinity version 2014-07-17 [14]. A total of 172,753 gene models were extracted. For each gene, the longest ORF was identified using EMBOSS getorf [15].

Anthocyanin and Flavonoid Biosynthesis.
All gene models in Gmax275 and G. max Enrei2 associated with anthocyanin and flavonoid biosynthesis were extracted and clustered using OrthoMCL [19]. Then these gene models were associated by BLASTn.
2.6. Proteome Analysis. The proteome analysis of Enrei cultivar was performed using seeds. The cotyledons from ten seeds were grounded in liquid nitrogen and purified by phase separation using standard procedures [24]. The purified proteins were digested with trypsin. For mass spectrometry analysis, the eluted peptides were analyzed on a nanospray LTQ XL Orbitrap mass spectrometer and the MS spectra were used for protein identification. Identification of proteins was performed using the Mascot search engine version 2.4.1 (Matrix Science, London, UK) and Proteome Discoverer software version 1.4.0.288 (Thermo Fisher Scientific) against 54,175 soybean peptide sequences [7]. Mascot results were filtered with Mascot Percolator software to improve the accuracy and sensitivity of the peptide identification [25]. The protein abundance was analyzed using emPAI value as described in Shinoda et al. [26]. Furthermore, the protein gene models derived from Gmax275 and G. max Enrei2 genome assemblies were associated using the clustered data obtained from OrthoMCL [19]. Additionally, these gene models were associated by BLASTn.  Table S6). In total, 11 gene models had no ORF hit sequences, 20,542 gene models had more than 50% coverage, 5,950 gene models had more than 90% coverage, and 2,269 gene models had 100% coverage. As mean coding sequence length was 1,168.1 bps, mean number of exons per gene 5.0, and mean exon length 231.5 bps of 56,044 Gmax275 gene models without variant, Enrei number of exons per gene was shorter and CDS was longer. The difference in the gene models between Gmax275 and G. max Enrei2 may be attributed to SNPs between the two cultivars as well as several parameters used in building the gene models.

Phylogenetic Analysis.
We applied OrthoMCL [19] to the clustered and aligned gene models of A. thaliana [16], A. lyrata [17], G. max cv. Williams 82 [7], G. max cv. Enrei, M. truncatula [18], and O. sativa (annotation data on Os-Nipponbare-Reference-IRGSP-1.0.). A set of filtered single copy genes was selected to calculate the phylogenetic relationships and divergence time among these species (Supplementary Table S1). Based on the phylogenetic divergence of  [28]. A whole genome duplication (WGD) which occurred around 58 Mya had been a major factor in shaping the M. truncatula genome [18].
The complex of G. max and its wild relative, G. soja, diverged from the most recent common ancestor around 0.27 Mya [4] or 0.8 Mya [5]. Assuming that G. max diverged from G. soja at around 0.8 Mya, the divergence of the branch for both Williams 82 and Enrei at around 0.34 Mya was much later than previously estimated. As Li et al. [5] pointed out, divergent selection may have contributed to the differentiation of G. soja and G. max before domestication of G. max. The divergent selection as adaptation to different environments must have contributed to the differentiation of both cv. Williams 82 and cv. Enrei from the most recent common ancestor.
3.6. Protein. Using gene models associated with seed proteome data, a total of 164 protein gene models corresponding to storage proteins, lipid synthesis/degradation enzymes, sorting/folding-related proteins, late embryogenesis abundant (LEA) protein, glycolysis pathway enzymes, protease/ protease inhibitors, and others were identified (Supplementary Table S8). The protein content of dry seeds was 35-42% [34,35] of the dry weight, and 70% of protein consists of 7S and 11S globulins [34,36], which are part of the cupin superfamily (http://www.ebi.ac.uk/interpro/entry/IPR006045), corresponding to beta-conglycinin and glycinin, respectively [37]. To identify the proteins associated with grain filling of soybean seeds, we conducted a proteome analysis of the cotyledon. A total of 160 protein gene models in Gmax189 correspond to Enrei protein gene models ranging from 7.87 mol% to 0.03 mol% (Supplementary Table S8). Most of these proteins are storage proteins and cupin including beta-conglycinin and glycinin representing about 42% of total mol% and about 55% of total weight (sum of mass * mol) ( Table 2). Genes controlling the content of seed storage proteins were also highly represented [38,39]. Genes associated with lipid metabolism such as lipoxygenase 1, peroxygenase 2, and oleosin family protein genes [40]; gene associated with sorting/folding-related protein such as HSP20-like chaperone, PDI-like, SNF7 family, and vacuolar sorting receptor proteins; and LEA protein genes which may be important in protecting other proteins from aggregations were highly represented in the Enrei genome. In addition, some genes involved in glycolysis pathway, enzymes, and proteinase/protease inhibitors, which may play an important role in germination stage, were also found. This proteome profile may provide the basis for understanding cultivar diversity and adaptation to cultivation condition.

Enrei Genome Database.
All sequencing data can be accessed in DAIZUbase (http://daizu.dna.affrc.go.jp/enrei/), an informatics resource for soybean genomics focusing on the Japanese soybean cultivar Enrei. The database is provided with a GBrowse [41] with interactive pages for displaying the Enrei genome sequence as well all aligned Enrei BAC clones and accompanying annotations. DAIZUbase also includes a unified map, which indicates the relationship between the linkage map and the physical map of the Enrei cultivar.

Conclusion
The genome sequence of the Japanese cultivar Enrei will provide valuable information for improvement of soybean cultivars adapted to domestic cultivation. The genome sequence will complement emerging strategies for effective soybean breeding through analysis of the genome structure of Japanese (domestic) soybean, development of DNA markers serving as landmarks of agronomically important traits, development of research resources for the identification of important genes in soybean, and isolation of genes controlling important traits such as disease and pest resistance, productivity, and regional adaptability. Detailed knowledge of the genes controlling specific traits will allow for more efficient soybean improvement enabling researchers to develop plant types adaptable to various environmental conditions.