• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基因组序列的高效数值表示:具有协方差分量的自然向量。

An efficient numerical representation of genome sequence: natural vector with covariance component.

机构信息

Department of Mathematical Sciences, Tsinghua University, Beijing, China.

Beijing Electronic Science and Technology Institute, Beijing, China.

出版信息

PeerJ. 2022 Jun 16;10:e13544. doi: 10.7717/peerj.13544. eCollection 2022.

DOI:10.7717/peerj.13544
PMID:35729905
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9206847/
Abstract

BACKGROUND

The characterization and comparison of microbial sequences, including archaea, bacteria, viruses and fungi, are very important to understand their evolutionary origin and the population relationship. Most methods are limited by the sequence length and lack of generality. The purpose of this study is to propose a general characterization method, and to study the classification and phylogeny of the existing datasets.

METHODS

We present a new alignment-free method to represent and compare biological sequences. By adding the covariance between each two nucleotides, the new 18-dimensional natural vector successfully describes 24,250 genomic sequences and 95,542 DNA barcode sequences. The new numerical representation is used to study the classification and phylogenetic relationship of microbial sequences.

RESULTS

First, the classification results validate that the six-dimensional covariance vector is necessary to characterize sequences. Then, the 18-dimensional natural vector is further used to conduct the similarity relationship between giant virus and archaea, bacteria, other viruses. The nearest distance calculation results reflect that the giant viruses are closer to bacteria in distribution of four nucleotides. The phylogenetic relationships of the three representative families, Mimiviridae, Pandoraviridae and Marsellieviridae from giant viruses are analyzed. The trees show that ten sequences of Mimiviridae are clustered with Pandoraviridae, and Mimiviridae is closer to the root of the tree than Marsellieviridae. The new developed alignment-free method can be computed very fast, which provides an effective numerical representation for the sequence of microorganisms.

摘要

背景

微生物序列(包括古菌、细菌、病毒和真菌)的特征化和比较对于理解它们的进化起源和种群关系非常重要。大多数方法受限于序列长度,并且缺乏通用性。本研究的目的是提出一种通用的特征化方法,并研究现有数据集的分类和系统发育。

方法

我们提出了一种新的无比对方法来表示和比较生物序列。通过添加每个两个核苷酸之间的协方差,新的 18 维自然向量成功描述了 24250 个基因组序列和 95542 个 DNA 条码序列。新的数值表示用于研究微生物序列的分类和系统发育关系。

结果

首先,分类结果验证了六维协方差向量对于特征化序列的必要性。然后,进一步使用 18 维自然向量来研究巨型病毒与古菌、细菌和其他病毒之间的相似性关系。最近距离计算结果反映了巨型病毒在四种核苷酸分布上与细菌更接近。对巨型病毒三个有代表性的科——Mimiviridae、Pandoraviridae 和 Marsellieviridae 进行了分析。树状图显示,Mimiviridae 科的十个序列与 Pandoraviridae 聚类,并且 Mimiviridae 比 Marsellieviridae 更接近树的根部。新开发的无比对方法可以非常快速地计算,为微生物序列提供了有效的数值表示。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4116/9206847/8a550bcc5b04/peerj-10-13544-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4116/9206847/eff06f1ffb57/peerj-10-13544-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4116/9206847/2a92e56fa53c/peerj-10-13544-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4116/9206847/fe4845d6aff0/peerj-10-13544-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4116/9206847/3c9b406d26df/peerj-10-13544-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4116/9206847/dfb88a1ac1e6/peerj-10-13544-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4116/9206847/47274c5775a7/peerj-10-13544-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4116/9206847/8a550bcc5b04/peerj-10-13544-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4116/9206847/eff06f1ffb57/peerj-10-13544-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4116/9206847/2a92e56fa53c/peerj-10-13544-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4116/9206847/fe4845d6aff0/peerj-10-13544-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4116/9206847/3c9b406d26df/peerj-10-13544-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4116/9206847/dfb88a1ac1e6/peerj-10-13544-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4116/9206847/47274c5775a7/peerj-10-13544-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4116/9206847/8a550bcc5b04/peerj-10-13544-g007.jpg

相似文献

1
An efficient numerical representation of genome sequence: natural vector with covariance component.基因组序列的高效数值表示:具有协方差分量的自然向量。
PeerJ. 2022 Jun 16;10:e13544. doi: 10.7717/peerj.13544. eCollection 2022.
2
A Puzzling Anomaly in the 4-Mer Composition of the Giant Pandoravirus Genomes Reveals a Stringent New Evolutionary Selection Process.巨潘多拉病毒基因组四聚体组成中的一个令人费解的异常现象揭示了一个严格的新进化选择过程。
J Virol. 2019 Nov 13;93(23). doi: 10.1128/JVI.01206-19. Print 2019 Dec 1.
3
Genome sequence comparison under a new form of tri-nucleotide representation based on bio-chemical properties of nucleotides.基于核苷酸生化性质的三核苷酸表示新形式下的基因组序列比较。
Gene. 2020 Mar 10;730:144257. doi: 10.1016/j.gene.2019.144257. Epub 2019 Nov 21.
4
Giant viruses coexisted with the cellular ancestors and represent a distinct supergroup along with superkingdoms Archaea, Bacteria and Eukarya.巨型病毒与细胞的祖先共存,与古菌、细菌和真核生物一起构成了一个独特的超界。
BMC Evol Biol. 2012 Aug 24;12:156. doi: 10.1186/1471-2148-12-156.
5
Mimiviridae: clusters of orthologous genes, reconstruction of gene repertoire evolution and proposed expansion of the giant virus family.拟病毒科:直系同源基因簇、基因库进化重建及巨型病毒家族的扩张建议。
Virol J. 2013 Apr 4;10:106. doi: 10.1186/1743-422X-10-106.
6
A new efficient method for analyzing fungi species using correlations between nucleotides.一种利用核苷酸相关性分析真菌物种的新方法。
BMC Evol Biol. 2018 Dec 27;18(1):200. doi: 10.1186/s12862-018-1330-y.
7
Fast and accurate genome comparison using genome images: The Extended Natural Vector Method.使用基因组图像进行快速准确的基因组比较:扩展自然向量方法。
Mol Phylogenet Evol. 2019 Dec;141:106633. doi: 10.1016/j.ympev.2019.106633. Epub 2019 Sep 26.
8
Comparative Genomics of Chrysochromulina Ericina Virus and Other Microalga-Infecting Large DNA Viruses Highlights Their Intricate Evolutionary Relationship with the Established Mimiviridae Family.金黄褐鞭藻病毒与其他感染微藻的大型DNA病毒的比较基因组学突显了它们与已确立的拟菌病毒科之间复杂的进化关系。
J Virol. 2017 Jun 26;91(14). doi: 10.1128/JVI.00230-17. Print 2017 Jul 15.
9
Isolation and Identification of a Large Green Alga Virus ( Virus XW01) of and Its Virophage ( Virus Virophage SW01) by Using Unicellular Green Algal Cultures.利用单细胞绿藻培养物分离和鉴定大型绿藻病毒(病毒 XW01)及其噬藻体(病毒噬藻体 SW01)。
J Virol. 2022 Apr 13;96(7):e0211421. doi: 10.1128/jvi.02114-21. Epub 2022 Mar 9.
10
Isolation of Yasminevirus, the First Member of Klosneuvirinae Isolated in Coculture with Vermamoeba vermiformis, Demonstrates an Extended Arsenal of Translational Apparatus Components.用 Vermamoeba vermiformis 共培养分离出 Yasminevirus,这是 Klosneuvirinae 目中第一个被分离的病毒,显示出扩展的翻译装置组件库。
J Virol. 2019 Dec 12;94(1). doi: 10.1128/JVI.01534-19.

引用本文的文献

1
Energy entropy vector: a novel approach for efficient microbial genomic sequence analysis and classification.能量熵向量:一种用于高效微生物基因组序列分析和分类的新方法。
Brief Bioinform. 2025 Sep 6;26(5). doi: 10.1093/bib/bbaf459.
2
Overview and Prospects of DNA Sequence Visualization.DNA序列可视化概述与展望
Int J Mol Sci. 2025 Jan 8;26(2):477. doi: 10.3390/ijms26020477.
3
Investigating alignment-free machine learning methods for HIV-1 subtype classification.研究用于HIV-1亚型分类的无比对机器学习方法。

本文引用的文献

1
Geometric construction of viral genome space and its applications.病毒基因组空间的几何构建及其应用。
Comput Struct Biotechnol J. 2021 Jul 27;19:4226-4234. doi: 10.1016/j.csbj.2021.07.028. eCollection 2021.
2
New Genome Sequence Detection via Natural Vector Convex Hull Method.基于自然向量凸壳方法的新型基因组序列检测
IEEE/ACM Trans Comput Biol Bioinform. 2022 May-Jun;19(3):1782-1793. doi: 10.1109/TCBB.2020.3040706. Epub 2022 Jun 3.
3
Classification of genomic components and prediction of genes of based on subsequence natural vector and support vector machine.
Bioinform Adv. 2024 Jul 29;4(1):vbae108. doi: 10.1093/bioadv/vbae108. eCollection 2024.
4
MANOCCA: a robust and computationally efficient test of covariance in high-dimension multivariate omics data.MANOCCA:一种稳健且计算高效的高维多元组学数据协方差检验方法。
Brief Bioinform. 2024 May 23;25(4). doi: 10.1093/bib/bbae272.
5
In-depth investigation of the point mutation pattern of HIV-1.深入研究 HIV-1 的点突变模式。
Front Cell Infect Microbiol. 2022 Nov 15;12:1033481. doi: 10.3389/fcimb.2022.1033481. eCollection 2022.
基于子序列自然向量和支持向量机的基因组成分分类及基因预测
PeerJ. 2020 Aug 3;8:e9625. doi: 10.7717/peerj.9625. eCollection 2020.
4
Positional Correlation Natural Vector: A Novel Method for Genome Comparison.位置相关自然向量:一种用于基因组比较的新方法。
Int J Mol Sci. 2020 May 29;21(11):3859. doi: 10.3390/ijms21113859.
5
Giant Viruses-Big Surprises.巨型病毒——巨大的惊喜。
Viruses. 2019 Apr 30;11(5):404. doi: 10.3390/v11050404.
6
A new efficient method for analyzing fungi species using correlations between nucleotides.一种利用核苷酸相关性分析真菌物种的新方法。
BMC Evol Biol. 2018 Dec 27;18(1):200. doi: 10.1186/s12862-018-1330-y.
7
Convex hull principle for classification and phylogeny of eukaryotic proteins.凸包原理在真核生物蛋白质分类和系统发育中的应用。
Genomics. 2019 Dec;111(6):1777-1784. doi: 10.1016/j.ygeno.2018.11.033. Epub 2018 Dec 5.
8
A measure of DNA sequence similarity by Fourier Transform with applications on hierarchical clustering.一种通过傅里叶变换衡量DNA序列相似性及其在层次聚类中的应用
J Theor Biol. 2014 Oct 21;359:18-28. doi: 10.1016/j.jtbi.2014.05.043. Epub 2014 Jun 6.
9
Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison.Kmacs:一种无比对的序列比对方法,通过 k-错配平均公共子串实现。
Bioinformatics. 2014 Jul 15;30(14):2000-8. doi: 10.1093/bioinformatics/btu331. Epub 2014 May 13.
10
Sequence analysis by iterated maps, a review.通过迭代映射进行序列分析,综述。
Brief Bioinform. 2014 May;15(3):369-75. doi: 10.1093/bib/bbt072. Epub 2013 Oct 25.