﻿Complete chloroplast genome sequence of Rhododendronmariesii and comparative genomics of related species in the family Ericaeae

﻿Abstract Rhododendronmariesii Hemsley et Wilson, 1907, a typical member of the family Ericaeae, possesses valuable medicinal and horticultural properties. In this research, the complete chloroplast (cp) genome of R.mariesii was sequenced and assembled, which proved to be a typical quadripartite structure with the length of 203,480 bp. In particular, the lengths of the large single copy region (LSC), small single copy region (SSC), and inverted repeat regions (IR) were 113,715 bp, 7,953 bp, and 40,918 bp, respectively. Among the 151 unique genes, 98 were protein-coding genes, 8 were tRNA genes, and 45 were rRNA genes. The structural characteristics of the R.mariesiicp genome was similar to other angiosperms. Leucine was the most representative amino acid, while cysteine was the lowest representative. Totally, 30 codons showed obvious codon usage bias, and most were A/U-ending codons. Six highly variable regions were observed, such as trnK-pafI and atpE-rpoB, which could serve as potential markers for future barcoding and phylogenetic research of R.mariesii species. Coding regions were more conserved than non-coding regions. Expansion and contraction in the IR region might be the main length variation in R.mariesii and related Ericaeae species. Maximum-likelihood (ML) phylogenetic analysis revealed that R.mariesii was relatively closed to the R.simsii Planchon, 1853 and R.pulchrum Sweet,1831. This research will supply rich genetic resource for R.mariesii and related species of the Ericaeae.


Introduction
Rhododendron mariesii Hemsley et Wilson, 1907, a typical member of the family Ericaeae, is mainly distributed in central China . Well known for leaf shape and bright-colored flowers, R. mariesii possesses valuable medicinal and horticultural properties . The deciduous species R. mariesii attracted great interest of Rhododendron breeders and geneticists. Furthermore, R. mariesii is very important in the woodland flora of the Dabie Mountains, and plays critical roles in ecological balance . Recently, Rhododendron-based ecological tourism, habitat fragmentation, and human activities have exerted significant effects towards natural growth of the wild Rhododendron population (Wang et al. 2019). Therefore, the research on population genetics and ecological conservation of wild R. mariesii is vital and necessary. However, limited genome information is available for R. mariesii, which has largely hindered corresponding genetic and molecular research.
In higher plants, the majority of plastomes are circular and quadripartite architecture consisting of two inverted repeat regions (IRa and IRb), a large single-copy region (LSC), and a small single-copy region (SSC) (Daniell et al. 2016;Hu et al. 2020;Abdullah et al. 2021). As maternally inherited organelle, the angiosperm plastome has a relatively conserved gene content and stable structure, which offers genetic markers sufficient for genome-wide evolutionary investigation at various taxonomic levels (Asaf et al. 2016;Zhang et al. 2017;Givnish et al. 2018). In plants, the size of cp genomes varied from 107 to 280 kb, containing approximately 130 genes related to photo synthesis and carbon fixation (Daniell et al. 2016;Rossini et al. 2021). The substitution rate of cp genome is lower than that of the nuclear genome, and 115-165 kb in cp genome is highly evolutionarily conserved (Smith 2015). However, specific genes exhibit accelerated evolution rates, such as ycf1, matK, and rbcL, which often serve as DNA barcoding (Dong et al. 2015;Wambugu et al. 2015;Zhang et al. 2017).
Next generation sequencing (NGS) has greatly increased the availability of genome data for non-species model, which facilitates the comparative cp genomics and phylogenetic studies at interspecific level (Santos and Almeida 2019;Pervez et al. 2022). In this research, the cpDNA of R. mariesii was assembled and annotated, SSR loci were characterized, comparative genomics and phylogenetic studies were also performed, hoping to benefit the studies of population evolution and conservation genetics of R. mariesii and related species.

Materials sampling and DNA extraction
Young and disease-free leaves of wild R. mariesii were sampled from the Dabie Mountains (central China,29°16.13'N,115°27.07'E,1,005 m), dried in silica, and stored at -20 °C until further usage. In particular, sample collection was authorized by the Biodiversity Conservation of Huanggang Normal University. The specimens were identified by Hongjin Dong (Huanggang Normal University), who possesses a doctoral degree in botany. All materials were well conserved in the Huanggang Normal University Herbarium (Hubei province, China). Total genomic DNA was extracted and purified from fresh leaves according to Wang et al. (2019). Subsequently, the quality of total genome DNA was verified in 1% agarose gel stained by GelRed and quantified by spectophotometer (NanoDrop 1000, Thermofisher Scientific, USA).

Genome sequencing, assembly, and annotation
Nextera DNA library preparation kit was used to construct the paired-end Illumina libraries. These libraries were sequenced on Illumina NovaSeq6000 Sequencing System (Illumina, Hayward, CA) in a paired-end run (500 cycles, 1 × 250 pb). After trimming adapter sequences and removing low-quality sequences, raw data was filtered by SOAPnuke software (Chen et al. 2018). Then, the high-quality reads were de novo assembled by GetOrganelle pipeline . BOWTIE2 were used to validate the assembled sequence error of R. mariesii cp genome through mapping raw sequencing reads to the assembled plastome (Hanussek et al. 2021). Online program Organelle Genome DRAW (OGDRAW) was used to draw the physical map of R. mariesiicp genome (Greiner et al. 2019). Furthermore, gene annotation and analysis were carried out with DOGMA and C P GAVAS softwares, respectively (Liu et al. 2012). The final annotations were also manually verified by Geneious (ver.8.0.2) (Yu et al. 2022). The cp genome data had been submitted to the National Center for Biotechnology Information (NCBI) database (https://www.ncbi.nlm.nih.gov/).

Codon usage and nucleotide diversity analysis
Codon usage frequency was analyzed by CodonW software (https://sourceforge.net/ projects/codonw/). Particularly, all protein coding genes were used for analysis. Relative synonymous codon usage (RSCU) analysis was carried out to measure codon usage bias (Rossini et al. 2021). The RSCU referred to the ratio of observed frequency of codons to frequency expected in regarding to the equal usage of synonymous codons for a certain amino acid (Rossini et al. 2021). In particular, RSCU value more than 1 means a preferred codon, otherwise the value less than and equal to 1 are considered as no codon usage bias (Morton 2022 Diels, 1990;R. datiandingense Feng, 1996;and R. kawakamii Hayata, 1911. Unique genes of these ten downloaded and the newly assemble R. mariesii cp genomes were extracted with PHYLOSUITE v1.2.2 and aligned by Windows version of MAFFT software, then nucleotide diversity (Pi) was calculated for each unique gene with DNASP ver 6.12.03 (Rossini et al. 2021).

Simple sequence repeats (SSR) analysis
MISA software (MicroSAtellite identification tool v2.1, http://pgrc.ipk-gatersleben.de/ misa) was used to identify SSR motifs. Minimum number of tandem repeat units were set as follows: five repeat units for tri-, tetra-, penta-, and hexanucleotide SSRs; six repeat units for di-nucleotide SSRs; 10 repeat units for mono-nucleotide SSRs. The maximal number of bases interrupting two SSRs in a compound microsatellite was 100 bp.

Phylogenetic analysis
Through searing NCBI database, 21 cp genomes of Ericaceae species were found and downloaded: 12 species of Rhododendron; two species of Vaccinium Linnaeus, 1753; Arbutus unedo Sims, 1822; Hemitomes congestum Asa Gray, 1858; Allotropa virgata Torrey et Gray, 1868, Monotropa hypopitys Linnaeus, 1753; Pityopus californicus (Eastwood) H.F.Copeland, 1935; and 2 species of Gaultheria Kalm, 1753. Together with the newly assembled R. mariesii cp genome, these 22 cp genomes were used to construct phylogeny tree. These cp genomes were initially aligned with MAFFT for phylogenetic analysis . RAxML (version 8.2.8 for Windows) was used to run maximum likelihood (ML) analysis with a bootstrap value of 1000 (Alexandros 2014). FIGTREE v1.4 was used to visualize and adjust the ML trees (Yu et al. 2022). In particular, cp genome of Pyrola rotundifolia Benth. (1840) played the roles of an out-group.

Comparative analysis of genome structure
The structural characteristics of cp genomes, containing newly assembled R. mariesii and 10 cp genomes of the genus Rhododendron (R. delavayi, R. henanense, R. micranthum, R. concinnum, R. griersonianum, R. simsii, R. kawakamii, R. molle, R. platypodum, and R. datiandingense) were compared and analyzed with mVISTA online tool (using Shuffle-LAGAN alignment program). In particular, the annotated cp genome of R. mariesii served as a reference against the other cp genome. Genome alignments, including rearrangements or inversions, was detected with MAUVE (Darling et al. 2004). For investigating whether expansion or contraction occurred in R. mariesii cp genome, the IR/LSC and IR/ SSC junction regions were compared with IRscope software (Amiryousefi et al. 2018).

General features of R. mariesii chloroplast genome
In total, 19,498,900 reads were obtained from NovaSeq paired-end run. After stringent quality assessment and filtering, 19,309,162 clean reads (2.891 Gb) with an average of 149 bp read length were obtained. The percentage of clean reads was 99.03%, and the clean bases were 2,891,089,781 bp. In particular, GC content was 39.52%.
In addition, Q20 (a base with quality value greater than 20) and Q30 (a base with quality value greater than 20) values were 97.28% and 92.34%, respectively. The size of R. mariesii cp genome is 203,480 bp. Moreover, typical quadripartite structure was observed, as a large single-copy (LSC) region (113,715 bp) and a small single-copy (SSC) region (7,953 bp) were separated by a pair identical inverted repeat regions (IRs) (40,918 bp) (Fig. 1).

Codon usage analysis and nucleotide diversity analysis
In the R. mariesii chloroplast genome, the protein-coding regions presented 40,013 codons (Table 3). Particularly, leucine (Leu) was the main amino acid (10.477%), followed by isoleucine (Ile, 8.972%) and glycine (Gly, 7.148%) (Fig. 2). In particular, cysteine (Cys) and tryptophan (Trp) were the lowest representative amino acids, accounting for 1.180% and 1.869%, respectively. According to RSCU values, a total of 30 codons showed obvious codon usage bias, as RSCU value were more than 1 (Table 3). Except Leu codon (UUG), all the other 29 codons were A/U-ending. For the 34 codons with RSCU values less than 1, 31 were C/G-ending, while 3 were A/U-ending. Nucleotide diversity analysis showed that sequence level of divergence existed between different Rhododendron cp genomes. Pi values for each gene region varied from 0 to 0.06896. High level of genetic variation mainly existed in SSC region (Pi =

Phylogenetic analysis
For clarifying the phylogenetic location of R. mariesii among the Ericaeae, complete plastomes of R. mariesii and other 21 species in the Ericaeae with fully sequenced  chloroplast genomes were used in reconstructing phylogenetic relationships. The phylogenetic tree revealed that R. mariesii had a close genetic relationship with R. simsii and R. pulchrum (Fig. 4). In particular, all these 22 taxa belonging to Ericaeae were grouped into one clade and clustered into two subclades. Topological structure was almost consistent with the previously published phylogeny (Liu et al. 2021

Comparative plastome sequence divergence and hotspot regions
Structural characteristics of 11 Rhododendron cp genomes were investigated with mVISTA software, containing the newly assembled R. mariesii cp genome and 10 download cp genome of the Rhododendron genus. In particular, the annotated R. mariesii cp genome served as a reference. Relatively high similarity was detected among these 11 Rhododendron species. Coding regions were more conserved than non-coding regions (CNS in Fig. 5). The LSC and SSC regions were relatively more stable than IR regions. Among these coding regions, rpoB, rpoC2, rps8, petD, rpl23, rpl22, and ndhF were relatively divergent because of intron regions. In R. mariesii plastid genome, highly variable regions mostly existed in the intergenic spacer, such as trnK-pafI, atpE-rpoB, trnT-rpl14, rpoA-psbJ, rpl20-trnE, ndhI-rps19, and rpl16-ndhI. Compared with intergenic spacer, protein coding regions were highly conserved, such as rps4, ndhJ, ndhK, rpcL, rps2, atpI, psaA, psaB, psbB, cemA, and petA. No rearrangements and inversions occurred in these 11 cp genomes of Rhododendron species. Particularly, lengths of the IR regions of 6 cp genomes ranged from 14,194 bp (R. mariesii cp genome) to 47,467 (R. griersonianum cp genome) (Fig. 6). Expansion and contraction existed in these cp genomes. In R. griersonianum cp genome, JLB line (line between LSC and IRb) was located between genes ycf15 and trnR, while ycf15 was located in the LSC region with 166 bp extending to IRb region. In R. mariesii cp genome, JLB line was located between trnV and rrn16 (476 bp extending to LSC region). In R. micranthum and R. henanense cp genomes, the JLB lines were located between Figure 5. Comparison of cp genomes with R. mariesii annotation serving as the reference. Vertical scale indicated the percentage of identity (50-100%), and horizontal axis was coordinates within cp genome. The genome regions were color-coded as exons, introns, and conserved non-coding sequences, respectively. rps12 and trnV, while trnV was located in IRb region with 913 bp and 935 bp extending to LSC region, respectively. However, JLB line was located between rps7 and trnI, and trnI was located in IRb region with 911 bp extending to LSC region in R. concinnum. Except for R. mariesii, ndhF was located in SSC region with 296 bp-314 bp extending to IRb region in the five other cp genomes (Fig. 6). Meanwhile, rps15 was located in SSC region of R. mariesii cp genome. The JSA line (line between SSC and IRa) was located between ndhF and rpl32, and ndhF was distributed in SSC region with 54 bp, 53 bp, 67 bp, 54 bp, and 37 bp to IRa region in R. griersonianum, R. concinnum, R. micranthum, R. henanense, and R. delavayi cp genomes, respectively. Besides ndhH, rps15 was also located in SSC region with 114 bp extending to IRa region in R. mariesii cp genome. Furthermore, JLA line (line between IRa and LSC) was located between trnV and psbA in R. concinnum, R. micranthum, and R. henanense. However, JLA line was located between trnR and trnH in R. griersonianum cp genome. In R. mariesii cp genome, rrn16 and trnV were located besides the JLA line.

Discussion
The chloroplast genome is the main organelle for plant transforming light energy into chemical energy (Zhang et al. 2018). Plastome genome is useful for comparative genomic research and phylogenomic analyses due to polymorphic regions generated through genomic expansion, inversion, contraction, and gene rearrangement (Sanitá Figure 6. The comparison of LSC, SSC, and IR regional boundaries of cp genome between R. mariesii and related taxa. JLB, JSB, JSA, and JLA respected "junction line between LSC and IRb", "junction line between IRb and SSC", "junction line between SSC and IRa" , as well as "junction line between IRa and LSC", respectively. Lima et al. 2016;Kahraman et al. 2019;Li et al. 2019;Liu et al. 2020;Wang et al. 2020). The single circular cp genome structure of R. mariesii was the same as other species belonging to the Ericaeae with a typical quadripartite structure and similar GC content unevenly distributed across the cp genome (Liu et al. 2021;Xu et al. 2022). The GC content of R. mariesii cp genome (39.52%) was slightly larger than that of Myracrodruon urundeuva Allemão, 1862 (37.8%) (Rossini et al. 2021). Relative to both LSC (35.85%) and SSC (36.49%) regions, the GC content in IR region (30.48%) is lower. However, the GC content in IR region was larger than LSC and SSC regions in cp genome Xanthium spinosum Linnaeus, 1753 (Raman et al. 2020). The length of total genome size and each region were similar to other plant cp genomes, such as R. molle, Rubus species (Rosaceae), and rubber dandelion (Taraxacum kok-saghyz Rodin) (Zhang et al. 2017;Xu et al. 2022;Yu et al. 2022).
Besides genes involved in photosynthesis transforming light energy into chemical energy, other genes also existed in R. mariesii cp genome. For example, accD gene, encoding plastid beta carboxyl transferase subunit of acetyl-CoA carboxylase (ACCase) important for plant growth (leaf growth, leaf longevity, fatty acid biosynthesis, and embryo development), has been reported to be involved in the adaptation to specific ecological niches during radiation of dicotyledonous plants (Hu et al. 2015). In R. mariesii cp genome, one copy of accD gene was also found. Codons coding Leu (10.477%), Ile (8.972%), and Gly (7.148%) were dominant, while Cys (1.180%) and Trp (1.869%) were the least, which were the same as that of M. urundeuva cp genome (Rossini et al. 2021). Codon bias, an efficient mechanism of translation influenced by natural selection and mutation pressure, takes place if synonymous codons are used at different frequencies (Zhang et al. 2022). A total of 30 codons showed codon usage bias, and most were A/U-ending codons, which were the same as that observed in M. urundeuva and Solanum (Zhang et al. 2018a;Rossini et al. 2021). In total, six gene regions showed high levels of nucleotide diversity (Pi > 0.02), containing trnI-GAU, trnG-UCC, rps3, rps12, trnV-UAC, and trnK-UUU, serving as the first candidate for developing molecular markers to identify Rhododendron species.
A total of 70 SSRs were identified from R. mariesii cp genome, more than that of M. urundeuva (36 SSRs), Spondias bahiensis P. Carvalho, 2015 (53 SSRs) and Mangifera indica Wallich, 1847 (57 SSRs), but fewer than that of Syringa pinnatifolias Hemsley, 1906 (253 SSRs) (Jo et al. 2017;Santos and Almeida 2019). Variation in the number and type of microsatellites might play important roles in plastome organization. The main motifs were A/T repeats (91.429%), which was the same with that of M. urundeuva, S. bahiensis, and M. indica (Jo et al. 2017;Santos and Almeida 2019;Rossini et al. 2021). However, no correlation was found between large repeat regions and rearrangement endpoints, which was similar with Liu et al. (2013). Very limited tandem (G/C)n-containing microsatellites were observed, which might be due to the low content of G and C bases in chloroplast genome. Molecular markers developed for the intergenic regions could be used for phylogenetic, phylogeographic, and barcoding studies of Rhododendron species.
Non-coding regions often mutate relatively faster than coding regions (Yu et al. 2022). In R. mariesii cp genome, coding regions were more conserved than the non-coding regions. Relatively high similarity was detected among these Rhododendron cp genomes, but expansion and contraction also existed in IR regions, which might be the dominant reason for variation in cp genome size. Obvious differences were found in cp IR boundary regions, containing gene contents and locations. However, IR regions were least divergent, which were mainly due to the presence of four highly conserved rRNA sequences in X. spinosum (Raman et al. 2020). Furthermore, LSC and SSC regions were relatively more stable than IR regions in R. mariesii cp genome. These genetic variations may significantly facilitate R. mariesii adapting to the changes of survival conditions. According to neutral theory, nucleotide substitution in non-coding regions (intergenic spacer, intron region, and pseudogenes) are neutral or near-neutral, which could not be affected by natural selection (Akashi et al. 2012). Therefore, evolutionary history of R. mariesii could be well calculated from the rate of molecular evolution in non-coding region.
This research aimed to expand the molecular genetic resources available for R. mariesii through high-throughput sequencing and cp genome assembly. The R. mariesii cp genome sequence could be used in distinguishing and resolving phylogenetic relationships within Ericaeae lineage. Moreover, this research will be vital for further genetic analysis on R. mariesii and other species in the Ericaeae family.

Conflicts of interests
The authors declare that they have no competing interests.

Data availability statement
The cp genome of R. mariesii was submitted to GenBank database under the accession number of OM161981.