Research Article |
Corresponding author: Shuzhen Wang ( wangshuzhen710@whu.edu.cn ) Academic editor: Ilya Gavrilov-Zimin
© 2023 Zhiliang Li, Zhiwei Huang, Xuchun Wan, Jiaojun Yu, Hongjin Dong, Jialiang Zhang, Chunyu Zhang, Shuzhen Wang.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Li Z, Huang Z, Wan X, Yu J, Dong H, Zhang J, Zhang C, Wang S (2023) Complete chloroplast genome sequence of Rhododendron mariesii and comparative genomics of related species in the family Ericaeae. Comparative Cytogenetics 17: 163-180. https://doi.org/10.3897/compcytogen.17.101427
|
Rhododendron mariesii Hemsley et Wilson, 1907, a typical member of the family Ericaeae, possesses valuable medicinal and horticultural properties. In this research, the complete chloroplast (cp) genome of R. mariesii was sequenced and assembled, which proved to be a typical quadripartite structure with the length of 203,480 bp. In particular, the lengths of the large single copy region (LSC), small single copy region (SSC), and inverted repeat regions (IR) were 113,715 bp, 7,953 bp, and 40,918 bp, respectively. Among the 151 unique genes, 98 were protein-coding genes, 8 were tRNA genes, and 45 were rRNA genes. The structural characteristics of the R. mariesii cp genome was similar to other angiosperms. Leucine was the most representative amino acid, while cysteine was the lowest representative. Totally, 30 codons showed obvious codon usage bias, and most were A/U-ending codons. Six highly variable regions were observed, such as trnK-pafI and atpE-rpoB, which could serve as potential markers for future barcoding and phylogenetic research of R. mariesii species. Coding regions were more conserved than non-coding regions. Expansion and contraction in the IR region might be the main length variation in R. mariesii and related Ericaeae species. Maximum-likelihood (ML) phylogenetic analysis revealed that R. mariesii was relatively closed to the R. simsii Planchon, 1853 and R. pulchrum Sweet,1831. This research will supply rich genetic resource for R. mariesii and related species of the Ericaeae.
chloroplast genome, comparative genomics, conservation genetics, phylogeny, Rhododendron mariesii
Rhododendron mariesii Hemsley et Wilson, 1907, a typical member of the family Ericaeae, is mainly distributed in central China (
In higher plants, the majority of plastomes are circular and quadripartite architecture consisting of two inverted repeat regions (IRa and IRb), a large single-copy region (LSC), and a small single-copy region (SSC) (
Next generation sequencing (NGS) has greatly increased the availability of genome data for non-species model, which facilitates the comparative cp genomics and phylogenetic studies at interspecific level (
Young and disease-free leaves of wild R. mariesii were sampled from the Dabie Mountains (central China, 29°16.13'N, 115°27.07'E, 1,005 m), dried in silica, and stored at -20 °C until further usage. In particular, sample collection was authorized by the Biodiversity Conservation of Huanggang Normal University. The specimens were identified by Hongjin Dong (Huanggang Normal University), who possesses a doctoral degree in botany. All materials were well conserved in the Huanggang Normal University Herbarium (Hubei province, China). Total genomic DNA was extracted and purified from fresh leaves according to
Nextera DNA library preparation kit was used to construct the paired-end Illumina libraries. These libraries were sequenced on Illumina NovaSeq6000 Sequencing System (Illumina, Hayward, CA) in a paired-end run (500 cycles, 1 × 250 pb). After trimming adapter sequences and removing low-quality sequences, raw data was filtered by SOAPnuke software (Chen et al. 2018). Then, the high-quality reads were de novo assembled by GetOrganelle pipeline (
Codon usage frequency was analyzed by CodonW software (https://sourceforge.net/projects/codonw/). Particularly, all protein coding genes were used for analysis. Relative synonymous codon usage (RSCU) analysis was carried out to measure codon usage bias (
In total, eleven full chloroplast genomes of genus Rhododendron were downloaded from NCBI database: R. molle Siebold et Zuccarini, 1846; R. griersonianum Balfour filius et Forrest, 1919; R. pulchrum (Sweet) George Don, 1834; R. henanense Fang, 1983; R. micranthum Maximowicz, 1870; R. delavayi Franchet, 1886; R. concinnum Hemsley, 1890; R. simsii Planchon, 1876; R. platypodum Diels, 1990; R. datiandingense Feng, 1996; and R. kawakamii Hayata, 1911. Unique genes of these ten downloaded and the newly assemble R. mariesii cp genomes were extracted with PHYLOSUITE v1.2.2 and aligned by Windows version of MAFFT software, then nucleotide diversity (Pi) was calculated for each unique gene with DNASP ver 6.12.03 (
MISA software (MicroSAtellite identification tool v2.1, http://pgrc.ipk-gatersleben.de/misa) was used to identify SSR motifs. Minimum number of tandem repeat units were set as follows: five repeat units for tri-, tetra-, penta-, and hexanucleotide SSRs; six repeat units for di-nucleotide SSRs; 10 repeat units for mono-nucleotide SSRs. The maximal number of bases interrupting two SSRs in a compound microsatellite was 100 bp.
Through searing NCBI database, 21 cp genomes of Ericaceae species were found and downloaded: 12 species of Rhododendron; two species of Vaccinium Linnaeus, 1753; Arbutus unedo Sims, 1822; Hemitomes congestum Asa Gray, 1858; Allotropa virgata Torrey et Gray, 1868, Monotropa hypopitys Linnaeus, 1753; Pityopus californicus (Eastwood) H.F.Copeland, 1935; and 2 species of Gaultheria Kalm, 1753. Together with the newly assembled R. mariesii cp genome, these 22 cp genomes were used to construct phylogeny tree. These cp genomes were initially aligned with MAFFT for phylogenetic analysis (
The structural characteristics of cp genomes, containing newly assembled R. mariesii and 10 cp genomes of the genus Rhododendron (R. delavayi, R. henanense, R. micranthum, R. concinnum, R. griersonianum, R. simsii, R. kawakamii, R. molle, R. platypodum, and R. datiandingense) were compared and analyzed with mVISTA online tool (using Shuffle-LAGAN alignment program). In particular, the annotated cp genome of R. mariesii served as a reference against the other cp genome. Genome alignments, including rearrangements or inversions, was detected with MAUVE (
In total, 19,498,900 reads were obtained from NovaSeq paired-end run. After stringent quality assessment and filtering, 19,309,162 clean reads (2.891 Gb) with an average of 149 bp read length were obtained. The percentage of clean reads was 99.03%, and the clean bases were 2,891,089,781 bp. In particular, GC content was 39.52%. In addition, Q20 (a base with quality value greater than 20) and Q30 (a base with quality value greater than 20) values were 97.28% and 92.34%, respectively. The size of R. mariesii cp genome is 203,480 bp. Moreover, typical quadripartite structure was observed, as a large single-copy (LSC) region (113,715 bp) and a small single-copy (SSC) region (7,953 bp) were separated by a pair identical inverted repeat regions (IRs) (40,918 bp) (Fig.
The chloroplast genome map of R. mariesii. Thick lines represented LSC, SSC, and IR regions, respectively. Genes shown inside circle were transcribed counterclockwise, and the outside outer circle were transcribed clockwise. Different gene groups were represented by different colors.
In total, 151 genes were successfully annotated, including 98 protein-coding genes, 45 tRNA genes, and 8 rRNA genes. The lengths of CDS, rRNA, tRNA, intergenic regions, and introns were 65,889 bp (32.38%), 8,998 bp (4.42%), 3,449 bp (1.7%), 45,409 bp (22.32%), and 80,033 bp (39.33%), respectively. The GC content of CDS, rRNA, tRNA, intergenic regions, and intron were 37.67%, 54.87%, 51.49%, 32.06%, and 33.75%, respectively. A set of 55 photosynthesis-related genes were found, containing six subunits of ATP synthase (atpA, atpB, atpE, atpF, atpH, and atpI), seven subunits of photosystem I, 17 subunits of photosystem II, 17 subunits of NADH-dehydrogenase, seven subunits of cytochrome b/f complex, and one subunit of rubisco (rbcL) (Table
Gene content of R. mariesii chloroplast genome. The duplicated genes were included into brackets.
Category of genes | Group of genes | Name of genes |
---|---|---|
Genes for photosynthesis | Subunits of ATP synthase | atpA, atpB, atpE, atpF, atpH, atpI |
Subunits of photosystem I | psaA, psaB, psaC (2×), psaI (2×), psaJ | |
Subunits of photosystem II | psbA, psbB, psbC, psbD, psbE, psbF, psbH, psbI, psbJ (3×), psbK, psbL, psbM, psbN, psbT, psbZ | |
Subunits of NADH-dehydrogenase | ndhA (2×), ndhB, ndhC, ndhD (2×), ndhE (2×), ndhF, ndhG (2×), ndhH (2×), ndhI (2×), ndhJ, ndhK | |
Subunits of cytochrome b/f complex | petA (2×), petB, petD, petG, petL, petN | |
Subunit of rubisco | rbcL | |
Self replication | Large subunit of ribosome | rpl2, rpl14, rpl16, rpl20, rpl22 (3×), rpl32 (2×), rpl33, rpl36 |
DNA dependent RNA polymerase | rpoA, rpoB, rpoC1, rpoC2 | |
Translational initiation factor | infA | |
Ribosomal RNA genes | rrn5S (2×), rrn16S (4×), rrn23S (2×) | |
Transfer RNA genes | trnK-UUU, trnH-GUC, trnS-GGA, trnT-UGU , trnT-GGU, trnL-UAA, trnL-CAA, trnL-UAG (2×), trnM-CAU (12×), trnF-GAA, trnV-UAC, trnV-GAC (2×), trnR-UCU, trnR-ACG (2×), trnS-CGA, trnS-GCU, trnS-UGA, trnQ-UUG, trnW-CCA, trnP-UGG, trnC-GCA, trnD-GUC, trnY-GUA, trnE-UUC (2×), trnA-UGC (2×), trnN-GUU, trnA-UGC, trnE-UUC (2×) | |
Small subunit of ribosome | rps2, rps3 (3×), rps4, rps7, rps8, rps11, rps14, rps15 (3×), rps16, rps18 (2×), rps19 (3×) | |
Other genes | Subunit of acetyl-CoA-carboxylase | accD |
c-type cytochrom synthesis gene | ccsA (2×) | |
Envelop membrane protein | cemA (2×) | |
Maturase | matK | |
Unkown function | Conserved open reading frames | ycf3, ycf4 (2×) |
Gene | Strand | Start | End | ExonI | IntronI | ExonII | IntronII | ExonIII |
---|---|---|---|---|---|---|---|---|
trnK-UUU | - | 1,834 | 4,404 | 37 | 2499 | 35 | ||
ycf3 | - | 6,794 | 8,753 | 124 | 711 | 232 | 742 | 151 |
trnL-UAA | + | 11,313 | 11,909 | 35 | 512 | 50 | ||
trnV-UAC | - | 15,031 | 15,692 | 39 | 588 | 35 | ||
rpoB | + | 21,836 | 25,719 | 3,169 | 677 | 38 | ||
atpF | + | 35,554 | 36,820 | 161 | 700 | 406 | ||
trnS-CGA | - | 38,853 | 39,609 | 31 | 666 | 60 | ||
accD | + | 55,356 | 56,894 | 571 | 159 | 150 | 54 | 605 |
rpl16 | - | 59,253 | 167,784 | 9 | 108,121 | 402 | ||
ndhB | - | 101,387 | 103,550 | 721 | 685 | 758 | ||
trnE-UUC | + | 112,521 | 113,535 | 32 | 943 | 40 | ||
trnA-UGC | + | 113,600 | 114,490 | 37 | 818 | 36 | ||
ndhA | + | 126,079 | 128,272 | 563 | 1090 | 541 | ||
ndhA | - | 181,938 | 184,131 | 563 | 1090 | 541 | ||
trnA-UGC | - | 195,720 | 196,610 | 37 | 818 | 36 | ||
trnE-UUC | - | 196,675 | 197,689 | 32 | 943 | 40 |
In the R. mariesii chloroplast genome, the protein-coding regions presented 40,013 codons (Table
Amino acid | Codon | No | RSCU | The codon frequency per amino acid(%) | Amino acid | Codon | No | RSCU | The codon frequency per amino acid(%) |
---|---|---|---|---|---|---|---|---|---|
Ala | GCA | 703 | 1.17 | 29.16 | Pro | CCA | 509 | 1.22 | 30.55 |
GCC | 345 | 0.57 | 14.31 | CCC | 294 | 0.71 | 17.65 | ||
GCG | 275 | 0.46 | 11.41 | CCG | 202 | 0.48 | 12.12 | ||
GCU | 1088 | 1.81 | 45.13 | CCU | 661 | 1.59 | 39.67 | ||
Cys | UGC | 118 | 0.5 | 24.99 | Gln | CAA | 1053 | 1.59 | 79.65 |
UGU | 354 | 1.5 | 74.97 | CAG | 269 | 0.41 | 20.35 | ||
Asp | GAC | 273 | 0.39 | 19.57 | Arg | AGA | 653 | 1.64 | 27.37 |
GAU | 1122 | 1.61 | 80.44 | AGG | 179 | 0.45 | 7.5 | ||
Glu | GAA | 1395 | 1.54 | 77.2 | CGA | 626 | 1.57 | 26.24 | |
GAG | 412 | 0.46 | 22.8 | CGC | 147 | 0.37 | 6.16 | ||
Phe | UUC | 753 | 0.64 | 32.19 | CGG | 158 | 0.4 | 6.62 | |
UUU | 1586 | 1.36 | 67.8 | CGU | 623 | 1.57 | 26.11 | ||
Gly | GGA | 1079 | 1.51 | 37.73 | Ser | AGC | 188 | 0.4 | 6.61 |
GGC | 327 | 0.46 | 11.43 | AGU | 580 | 1.22 | 20.41 | ||
GGG | 465 | 0.65 | 16.26 | UCA | 507 | 1.07 | 17.84 | ||
GGU | 989 | 1.38 | 34.58 | UCC | 415 | 0.88 | 14.6 | ||
His | CAC | 223 | 0.47 | 23.28 | UCG | 240 | 0.51 | 8.44 | |
CAU | 735 | 1.53 | 76.73 | UCU | 912 | 1.93 | 32.09 | ||
Ile | AUA | 1125 | 0.94 | 31.34 | Thr | ACA | 640 | 1.22 | 30.39 |
AUC | 674 | 0.56 | 18.78 | ACC | 411 | 0.78 | 19.52 | ||
AUU | 1791 | 1.5 | 49.89 | ACG | 213 | 0.4 | 10.11 | ||
Lys | AAA | 1631 | 1.54 | 77.11 | ACU | 842 | 1.6 | 39.98 | |
AAG | 484 | 0.46 | 22.88 | Val | GUA | 845 | 1.44 | 36.11 | |
Leu | CUA | 518 | 0.74 | 12.36 | GUC | 302 | 0.52 | 12.91 | |
CUC | 251 | 0.36 | 5.99 | GUG | 319 | 0.55 | 13.63 | ||
CUG | 248 | 0.35 | 5.92 | GUU | 874 | 1.49 | 37.35 | ||
CUU | 889 | 1.27 | 21.21 | Trp | UGG | 748 | 1 | 100.02 | |
UUA | 1475 | 2.11 | 35.18 | Tyr | UAC | 306 | 0.41 | 20.32 | |
UUG | 811 | 1.16 | 19.35 | UAU | 1200 | 1.59 | 79.68 | ||
Met | AUG | 954 | 1 | 100.01 | Stop* | UAA | 133 | 1.22 | 40.69 |
Asn | AAC | 382 | 0.46 | 22.78 | UAG | 88 | 0.81 | 26.92 | |
AAU | 1295 | 1.54 | 77.22 | UGA | 106 | 0.97 | 32.42 |
Nucleotide diversity analysis showed that sequence level of divergence existed between different Rhododendron cp genomes. Pi values for each gene region varied from 0 to 0.06896. High level of genetic variation mainly existed in SSC region (Pi = 0.01723), followed by LSC (Pi = 0.00697) and IR (Pi = 0.001224) regions (Fig.
The nucleotide diversity (Pi) of 11 Rhododendron species chloroplast genomes. X-axis presented the position of aligned chloroplast genomes, and Y-axis referred to Pi value. Below the X-axis, large single-copy (LSC), small single-copy (SSC), as well as inverted repeat (IR) regions were displayed with arrow bars.
A set of 70 SSRs were identified from R. mariesii cp genome, and 5 SSRs were present in compound formation. Particularly, 65 SSRs (92.86%) were mononucleotide motifs, 2 were dinucleotide motifs (2.86%), 2 were trinucleotide motifs (2.86%), and 1 were hexanucleotide repeats (1.43%) (Table
Repeats | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | total | Percentage |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A/T | - | - | - | - | - | 32 | 14 | 9 | 7 | 2 | 64 | 91.429% | |
C/G | - | - | - | - | - | 1 | 1 | 1.429% | |||||
AT/AT | - | 2 | 2 | 2.857% | |||||||||
AAG/CTT | 1 | 1 | 1.429% | ||||||||||
AAT/ATT | 1 | 1 | 1.429% | ||||||||||
AAGGGT/ACCCTT | 1 | 1 | 1.429% |
Mononucleotide A/T repeats with repeat numbers of 10–14 were the most abundant. Meanwhile, (C/G)n microsatellites were all repeated 15 times. In relation to dinucleotide repeats, the identified SSRs all have 7 repeat motifs. Regarding to trinucleotide motifs, AAG/CTT and AAT/ATT microsatellites repeated 5 and 11 times, respectively. The hexanucleotide motif AAGGGT/ACCCTT repeated 5 times. Totally, 34 SSRs were present in the intergenic spacer region, accounting for 41.43%. Moreover, 28 SSRs were present in rpl16 gene. All the remaining 13 microsatellites were found in ccsA, cemA, ndhA, rpoA, rpoC2, rps7, rps8, and trnL-UAA genes.
For clarifying the phylogenetic location of R. mariesii among the Ericaeae, complete plastomes of R. mariesii and other 21 species in the Ericaeae with fully sequenced chloroplast genomes were used in reconstructing phylogenetic relationships. The phylogenetic tree revealed that R. mariesii had a close genetic relationship with R. simsii and R. pulchrum (Fig.
Structural characteristics of 11 Rhododendron cp genomes were investigated with mVISTA software, containing the newly assembled R. mariesii cp genome and 10 download cp genome of the Rhododendron genus. In particular, the annotated R. mariesii cp genome served as a reference. Relatively high similarity was detected among these 11 Rhododendron species. Coding regions were more conserved than non-coding regions (CNS in Fig.
Comparison of cp genomes with R. mariesii annotation serving as the reference. Vertical scale indicated the percentage of identity (50–100%), and horizontal axis was coordinates within cp genome. The genome regions were color-coded as exons, introns, and conserved non-coding sequences, respectively.
Particularly, lengths of the IR regions of 6 cp genomes ranged from 14,194 bp (R. mariesii cp genome) to 47,467 (R. griersonianum cp genome) (Fig.
The comparison of LSC, SSC, and IR regional boundaries of cp genome between R. mariesii and related taxa. JLB, JSB, JSA, and JLA respected “junction line between LSC and IRb”, “junction line between IRb and SSC”, “junction line between SSC and IRa” , as well as “junction line between IRa and LSC”, respectively.
The chloroplast genome is the main organelle for plant transforming light energy into chemical energy (
The size of R. mariesii cp genome (203,480 bp) was larger than that of R. pulchrum (146,941 bp), R. simsii (152,214 bp), R. molle (197,877 bp), R. delavayi (193,798 bp), and R. platypodum (201,047 bp), but smaller than that of R. kawakamii (230,777 bp), R. micranthum (207,233 bp), R. henanense (208,015 bp), R. griersonianum (206,467 bp), R. concinnum (207,236 bp), and R. datiandingense (207,311 bp). Totally, 151 genes existed in R. mariesii cp genome, which were more than that of R. molle (149 genes) and R. pulchrum (73 genes) (
Besides genes involved in photosynthesis transforming light energy into chemical energy, other genes also existed in R. mariesii cp genome. For example, accD gene, encoding plastid beta carboxyl transferase subunit of acetyl-CoA carboxylase (ACCase) important for plant growth (leaf growth, leaf longevity, fatty acid biosynthesis, and embryo development), has been reported to be involved in the adaptation to specific ecological niches during radiation of dicotyledonous plants (
A total of 70 SSRs were identified from R. mariesii cp genome, more than that of M. urundeuva (36 SSRs), Spondias bahiensis P. Carvalho, 2015 (53 SSRs) and Mangifera indica Wallich, 1847 (57 SSRs), but fewer than that of Syringa pinnatifolias Hemsley, 1906 (253 SSRs) (
Non-coding regions often mutate relatively faster than coding regions (
This research aimed to expand the molecular genetic resources available for R. mariesii through high-throughput sequencing and cp genome assembly. The R. mariesii cp genome sequence could be used in distinguishing and resolving phylogenetic relationships within Ericaeae lineage. Moreover, this research will be vital for further genetic analysis on R. mariesii and other species in the Ericaeae family.
The authors declare that they have no competing interests.
The cp genome of R. mariesii was submitted to GenBank database under the accession number of OM161981.
This work was supported by grant from Scientific and Technological Research Project of Hubei Provincial Department of Education (B2022204) and Open fund of Hubei Key Laboratory of Economic Forest Germplasm Improvement and Resources Comprehensive Utilition (202303202).
Taxonomic and accession information on cp genomes downloaded from NCBI database
Data type: wps