Armijos Carrion Angelo D, Hinsinger Damien D, Strijk Joeri S
Biodiversity Genomics Team, Plant Ecophysiology & Evolution Group, Guangxi Key Laboratory of Forest Ecology and Conservation, College of Forestry, Guangxi University, Nanning, Guangxi, PR China.
Alliance for Conservation Tree Genomics, Pha Tad Ke Botanical Garden, Luang Prabang, Laos.
PeerJ. 2020 Apr 7;8:e8699. doi: 10.7717/peerj.8699. eCollection 2020.
With the rapid increase in availability of genomic resources offered by Next-Generation Sequencing (NGS) and the availability of free online genomic databases, efficient and standardized metadata curation approaches have become increasingly critical for the post-processing stages of biological data. Especially in organelle-based studies using circular chloroplast genome datasets, the assembly of the main structural regions in random order and orientation represents a major limitation in our ability to easily generate "ready-to-align" datasets for phylogenetic reconstruction, at both small and large taxonomic scales. In addition, current practices discard the most variable regions of the genomes to facilitate the alignment of the remaining coding regions. Nevertheless, no software is currently available to perform curation to such a degree, through simple detection, organization and positioning of the main plastome regions, making it a time-consuming and error-prone process. Here we introduce a fast and user friendly software , a Perl script specifically designed to automate the detection and reorganization of newly assembled plastomes obtained from any source available (NGS, sanger sequencing or assembler output).
uses a sliding-window approach to detect long repeated sequences in draft sequences, which then identifies the inverted repeat regions (IRs), even in case of artifactual breaks or sequencing errors and automates the rearrangement of the sequence to the widely used LSC-Irb-SSC-IRa order. This facilitates rapid post-editing steps such as creation of genome alignments, detection of variable regions, SNP detection and phylogenomic analyses.
was successfully tested on plant families throughout the angiosperm phylogeny by curating 161 chloroplast datasets. first identified and reordered the central regions (LSC-Irb-SSC-IRa) for each dataset and then produced a new annotation for the chloroplast sequences. The process took less than 20 min with a maximum memory requirement of 150 MB and an accuracy of over 99%.
is the sole de novo one-step recognition and re-ordination tool that provides facilitation in the post-processing analysis of the extra nuclear genomes from NGS data. The program is available at https://github.com/BiodivGenomic/ECuADOR/.
随着下一代测序(NGS)提供的基因组资源迅速增加以及免费在线基因组数据库的出现,高效且标准化的元数据管理方法对于生物数据的后处理阶段变得越来越关键。特别是在使用环状叶绿体基因组数据集进行的基于细胞器的研究中,主要结构区域以随机顺序和方向进行组装,这严重限制了我们在小分类尺度和大分类尺度上轻松生成用于系统发育重建的“随时可比对”数据集的能力。此外,当前的做法会舍弃基因组中变化最大的区域,以便于其余编码区域的比对。然而,目前尚无软件能够通过简单地检测、组织和定位主要质体基因组区域来达到这样的管理程度,这使得该过程既耗时又容易出错。在此,我们介绍一款快速且用户友好的软件,这是一个专门设计的Perl脚本,用于自动检测和重新组织从任何可用来源(NGS、桑格测序或组装器输出)获得的新组装质体基因组。
使用滑动窗口方法在草图序列中检测长重复序列,进而识别反向重复区域(IR),即使存在人为断裂或测序错误的情况,也能自动将序列重新排列为广泛使用的LSC-Irb-SSC-IRa顺序。这便于快速进行后编辑步骤,如创建基因组比对、检测可变区域、单核苷酸多态性(SNP)检测和系统基因组分析。
通过管理161个叶绿体数据集,该软件在被子植物系统发育的整个植物科中成功进行了测试。它首先为每个数据集识别并重新排列中心区域(LSC-Irb-SSC-IRa),然后为叶绿体序列生成新的注释。该过程耗时不到20分钟,最大内存需求为150MB,准确率超过99%。
该软件是唯一的从头一步识别和重新排序工具,为从NGS数据进行核外基因组的后处理分析提供便利。该程序可在https://github.com/BiodivGenomic/ECuADOR/获取。