基于具有相同碱基组合的统计上相同的K字的基因组特异性分析

Specificity Analysis of Genome Based on Statistically Identical K-Words With Same Base Combination.

作者信息

Seo Hyein, Song Yong-Joon, Cho Kiho, Cho Dong-Ho

机构信息

School of Electrical EngineeringKorea Advanced Institute of Science and Technology (KAIST) Daejeon 300-010 South Korea.

Department of SurgeryUniversity of California Sacramento California 95064 USA.

出版信息

IEEE Open J Eng Med Biol. 2020 Jul 14;1:214-219. doi: 10.1109/OJEMB.2020.3009055. eCollection 2020.

DOI:10.1109/OJEMB.2020.3009055

PMID:35402963

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8983152/

Abstract

Individual characteristics are determined through a genome consisting of a complex base combination. This base combination is reflected in the k-word profile, which represents the number of consecutive k bases. Therefore, it is important to analyze the genome-specific statistical specificity in the k-word profile to understand the characteristics of the genome. In this paper, we propose a new k-word-based method to analyze genome-specific properties. We define k-words consisting of the same number of bases as statistically identical k-words. The statistically identical k-words are estimated to appear at a similar frequency by statistical prediction. However, this may not be true in the genome because it is not a random list of bases. The ratio between frequencies of two statistically identical k-words can then be used to investigate the statistical specificity of the genome reflected in the k-word profile. In order to find important ratios representing genomic characteristics, a reference value is calculated that results in a minimum error when classifying data by ratio alone. Finally, we propose a genetic algorithm-based search algorithm to select a minimum set of ratios useful for classification. The proposed method was applied to the full-length sequence of microorganisms for pathogenicity classification. The classification accuracy of the proposed algorithm was similar to that of conventional methods while using only a few features. We proposed a new method to investigate the genome-specific statistical specificity in the k-word profile which can be applied to find important properties of the genome and classify genome sequences.

摘要

个体特征是通过由复杂碱基组合构成的基因组来确定的。这种碱基组合反映在k字谱中，k字谱代表连续k个碱基的数量。因此，分析k字谱中基因组特异性的统计特异性对于理解基因组特征很重要。在本文中，我们提出了一种基于k字的新方法来分析基因组特异性属性。我们将由相同数量碱基组成的k字定义为统计上相同的k字。通过统计预测估计统计上相同的k字会以相似的频率出现。然而，在基因组中这可能并不成立，因为它不是一个随机的碱基列表。然后，两个统计上相同的k字的频率之比可用于研究k字谱中反映的基因组的统计特异性。为了找到代表基因组特征的重要比率，计算一个参考值，该参考值在仅按比率对数据进行分类时会导致最小误差。最后，我们提出了一种基于遗传算法的搜索算法来选择一组对分类有用的最小比率集。所提出的方法应用于微生物的全长序列进行致病性分类。所提出算法的分类准确率与传统方法相似，同时仅使用了少数特征。我们提出了一种新方法来研究k字谱中基因组特异性的统计特异性，该方法可用于发现基因组的重要属性并对基因组序列进行分类。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/08bf/8983152/abf3a48abe4e/cho1-3009055.jpg

相似文献

Specificity Analysis of Genome Based on Statistically Identical K-Words With Same Base Combination.基于具有相同碱基组合的统计上相同的K字的基因组特异性分析

IEEE Open J Eng Med Biol. 2020 Jul 14;1:214-219. doi: 10.1109/OJEMB.2020.3009055. eCollection 2020.

Classification of various genomic sequences based on distribution of repeated k-word.基于重复k字分布的各种基因组序列分类

Annu Int Conf IEEE Eng Med Biol Soc. 2017 Jul;2017:3894-3897. doi: 10.1109/EMBC.2017.8037707.

A new alignment free genome comparison algorithm based on statistically estimated feature frequency profile.一种基于统计估计特征频率分布的新型无比对基因组比较算法。

Annu Int Conf IEEE Eng Med Biol Soc. 2017 Jul;2017:4265-4268. doi: 10.1109/EMBC.2017.8037798.

Phylogenetic tree construction using trinucleotide usage profile (TUP).使用三核苷酸使用谱（TUP）构建系统发育树。

BMC Bioinformatics. 2016 Oct 6;17(Suppl 13):381. doi: 10.1186/s12859-016-1222-3.

Comprehensive Word-Level Classification of Screening Mammography Reports Using a Neural Network Sequence Labeling Approach.基于神经网络序列标注方法的乳腺 X 线摄影筛查报告的全面词级分类。

J Digit Imaging. 2019 Oct;32(5):685-692. doi: 10.1007/s10278-018-0141-4.

Using Markov model to improve word normalization algorithm for biological sequence comparison.使用马尔可夫模型改进生物序列比对的词法归一化算法。

Amino Acids. 2012 May;42(5):1867-77. doi: 10.1007/s00726-011-0906-2. Epub 2011 Apr 20.

The word landscape of the non-coding segments of the Arabidopsis thaliana genome.拟南芥基因组非编码区段的词汇景观。

BMC Genomics. 2009 Oct 8;10:463. doi: 10.1186/1471-2164-10-463.

On avoided words, absent words, and their application to biological sequence analysis.论避免出现的词、缺失的词及其在生物序列分析中的应用。

Algorithms Mol Biol. 2017 Mar 14;12:5. doi: 10.1186/s13015-017-0094-z. eCollection 2017.

[Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].[通过新型人类基因的电子克隆和实验验证对NCBI人类基因数据库中出现的模型参考序列的一些错误进行分析、鉴定和校正]

Yi Chuan Xue Bao. 2004 May;31(5):431-43.

Probabilistic topic modeling for the analysis and classification of genomic sequences.用于基因组序列分析和分类的概率主题建模

BMC Bioinformatics. 2015;16 Suppl 6(Suppl 6):S2. doi: 10.1186/1471-2105-16-S6-S2. Epub 2015 Apr 17.

引用本文的文献

Heuristic Analysis of Genomic Sequence Processing Models for High Efficiency Prediction: A Statistical Perspective.基于统计视角的基因组序列处理模型启发式分析以实现高效预测

Curr Genomics. 2022 Nov 18;23(5):299-317. doi: 10.2174/1389202923666220927105311.

本文引用的文献

Alignment-Free Sequence Analysis and Applications.无比对序列分析及其应用

Annu Rev Biomed Data Sci. 2018 Jul;1:93-114. doi: 10.1146/annurev-biodatasci-080917-013431. Epub 2018 Apr 25.

Genome classification improvements based on k-mer intervals in sequences.基于序列中 k-mer 间隔的基因组分类改进。

Genomics. 2019 Dec;111(6):1574-1582. doi: 10.1016/j.ygeno.2018.11.001. Epub 2018 Nov 13.

GenBank.GenBank。

Nucleic Acids Res. 2019 Jan 8;47(D1):D94-D99. doi: 10.1093/nar/gky989.

DNA sequencing at 40: past, present and future.DNA 测序 40 年：过去、现在与未来。

Nature. 2017 Oct 19;550(7676):345-353. doi: 10.1038/nature24286. Epub 2017 Oct 11.

Alignment-free sequence comparison: benefits, applications, and tools.无比对信息的序列比对：优势、应用和工具。

Genome Biol. 2017 Oct 3;18(1):186. doi: 10.1186/s13059-017-1319-7.

K-mer Content, Correlation, and Position Analysis of Genome DNA Sequences for the Identification of Function and Evolutionary Features.用于功能和进化特征识别的基因组DNA序列的K-mer含量、相关性及位置分析

Genes (Basel). 2017 Apr 19;8(4):122. doi: 10.3390/genes8040122.

Viral Phylogenomics Using an Alignment-Free Method: A Three-Step Approach to Determine Optimal Length of k-mer.基于无比对方法的病毒系统发生基因组学：确定 k-mer 最优长度的三步法。

Sci Rep. 2017 Jan 19;7:40712. doi: 10.1038/srep40712.

Fast and accurate phylogeny reconstruction using filtered spaced-word matches.使用过滤后的间隔词匹配进行快速准确的系统发育重建。

Bioinformatics. 2017 Apr 1;33(7):971-979. doi: 10.1093/bioinformatics/btw776.

Nullomers and High Order Nullomers in Genomic Sequences.基因组序列中的零聚体和高阶零聚体

PLoS One. 2016 Dec 1;11(12):e0164540. doi: 10.1371/journal.pone.0164540. eCollection 2016.

Inversion symmetry of DNA k-mer counts: validity and deviations.DNA k 元组计数的反演对称性：有效性与偏差

BMC Genomics. 2016 Aug 31;17(1):696. doi: 10.1186/s12864-016-3012-8.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于具有相同碱基组合的统计上相同的K字的基因组特异性分析

Specificity Analysis of Genome Based on Statistically Identical K-Words With Same Base Combination.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献