Department of Mathematical Sciences, Tsinghua University, Beijing, China.
Beijing Electronic Science and Technology Institute, Beijing, China.
PeerJ. 2022 Jun 16;10:e13544. doi: 10.7717/peerj.13544. eCollection 2022.
The characterization and comparison of microbial sequences, including archaea, bacteria, viruses and fungi, are very important to understand their evolutionary origin and the population relationship. Most methods are limited by the sequence length and lack of generality. The purpose of this study is to propose a general characterization method, and to study the classification and phylogeny of the existing datasets.
We present a new alignment-free method to represent and compare biological sequences. By adding the covariance between each two nucleotides, the new 18-dimensional natural vector successfully describes 24,250 genomic sequences and 95,542 DNA barcode sequences. The new numerical representation is used to study the classification and phylogenetic relationship of microbial sequences.
First, the classification results validate that the six-dimensional covariance vector is necessary to characterize sequences. Then, the 18-dimensional natural vector is further used to conduct the similarity relationship between giant virus and archaea, bacteria, other viruses. The nearest distance calculation results reflect that the giant viruses are closer to bacteria in distribution of four nucleotides. The phylogenetic relationships of the three representative families, Mimiviridae, Pandoraviridae and Marsellieviridae from giant viruses are analyzed. The trees show that ten sequences of Mimiviridae are clustered with Pandoraviridae, and Mimiviridae is closer to the root of the tree than Marsellieviridae. The new developed alignment-free method can be computed very fast, which provides an effective numerical representation for the sequence of microorganisms.
微生物序列(包括古菌、细菌、病毒和真菌)的特征化和比较对于理解它们的进化起源和种群关系非常重要。大多数方法受限于序列长度,并且缺乏通用性。本研究的目的是提出一种通用的特征化方法,并研究现有数据集的分类和系统发育。
我们提出了一种新的无比对方法来表示和比较生物序列。通过添加每个两个核苷酸之间的协方差,新的 18 维自然向量成功描述了 24250 个基因组序列和 95542 个 DNA 条码序列。新的数值表示用于研究微生物序列的分类和系统发育关系。
首先,分类结果验证了六维协方差向量对于特征化序列的必要性。然后,进一步使用 18 维自然向量来研究巨型病毒与古菌、细菌和其他病毒之间的相似性关系。最近距离计算结果反映了巨型病毒在四种核苷酸分布上与细菌更接近。对巨型病毒三个有代表性的科——Mimiviridae、Pandoraviridae 和 Marsellieviridae 进行了分析。树状图显示,Mimiviridae 科的十个序列与 Pandoraviridae 聚类,并且 Mimiviridae 比 Marsellieviridae 更接近树的根部。新开发的无比对方法可以非常快速地计算,为微生物序列提供了有效的数值表示。