Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Chiba, Japan.
Laboratory of Plant Molecular Genetics, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo, Japan.
PLoS Comput Biol. 2021 Jan 12;17(1):e1008597. doi: 10.1371/journal.pcbi.1008597. eCollection 2021 Jan.
Plant mitochondrial genomes have distinctive features compared to those of animals; namely, they are large and divergent, with sizes ranging from hundreds of thousands of to a few million bases. Recombination among repetitive regions is thought to produce similar structures that differ slightly, known as "multipartite structures," which contribute to different phenotypes. Although many reference plant mitochondrial genomes represent almost all the genes in mitochondria, the full spectrum of their structures remains largely unknown. The emergence of long-read sequencing technology is expected to yield this landscape; however, many studies aimed to assemble only one representative circular genome, because properly understanding multipartite structures using existing assemblers is not feasible. To elucidate multipartite structures, we leveraged the information in existing reference genomes and classified long reads according to their corresponding structures. We developed a method that exploits two classic algorithms, partial order alignment (POA) and the hidden Markov model (HMM) to construct a sensitive read classifier. This method enables us to represent a set of reads as a POA graph and analyze it using the HMM. We can then calculate the likelihood of a read occurring in a given cluster, resulting in an iterative clustering algorithm. For synthetic data, our proposed method reliably detected one variation site out of 9,000-bp synthetic long reads with a 15% sequencing-error rate and produced accurate clustering. It was also capable of clustering long reads from six very similar sequences containing only slight differences. For real data, we assembled putative multipartite structures of mitochondrial genomes of Arabidopsis thaliana from nine accessions sequenced using PacBio Sequel. The results indicated that there are recurrent and strain-specific structures in A. thaliana mitochondrial genomes.
与动物的线粒体基因组相比,植物的线粒体基因组具有独特的特征;即它们体积大且具有差异,大小范围从数十万到几百万个碱基。重复区域之间的重组被认为会产生略有不同的类似结构,称为“多份结构”,这些结构有助于产生不同的表型。尽管许多参考植物线粒体基因组代表了线粒体中的几乎所有基因,但它们的结构全貌在很大程度上仍未知。长读测序技术的出现有望揭示这一景观;然而,许多旨在组装一个代表性圆形基因组的研究,因为使用现有的组装器正确理解多份结构是不可行的。为了阐明多份结构,我们利用了现有参考基因组中的信息,并根据它们对应的结构对长读进行分类。我们开发了一种利用两个经典算法,部分排序比对(POA)和隐马尔可夫模型(HMM)来构建敏感读分类器的方法。该方法使我们能够将一组读表示为 POA 图,并使用 HMM 对其进行分析。然后,我们可以计算给定簇中读出现的可能性,从而得到一个迭代聚类算法。对于合成数据,我们的方法能够以 15%的测序错误率可靠地检测到 9000 个碱基长的合成长读中的一个变异位点,并产生准确的聚类。它还能够对仅略有差异的六个非常相似的序列的长读进行聚类。对于真实数据,我们从使用 PacBio Sequel 测序的 9 个拟南芥品系中组装了拟南芥线粒体基因组的可能多份结构。结果表明,拟南芥线粒体基因组中存在反复出现和菌株特异性的结构。