• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于具有相同碱基组合的统计上相同的K字的基因组特异性分析

Specificity Analysis of Genome Based on Statistically Identical K-Words With Same Base Combination.

作者信息

Seo Hyein, Song Yong-Joon, Cho Kiho, Cho Dong-Ho

机构信息

School of Electrical EngineeringKorea Advanced Institute of Science and Technology (KAIST) Daejeon 300-010 South Korea.

Department of SurgeryUniversity of California Sacramento California 95064 USA.

出版信息

IEEE Open J Eng Med Biol. 2020 Jul 14;1:214-219. doi: 10.1109/OJEMB.2020.3009055. eCollection 2020.

DOI:10.1109/OJEMB.2020.3009055
PMID:35402963
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8983152/
Abstract

Individual characteristics are determined through a genome consisting of a complex base combination. This base combination is reflected in the k-word profile, which represents the number of consecutive k bases. Therefore, it is important to analyze the genome-specific statistical specificity in the k-word profile to understand the characteristics of the genome. In this paper, we propose a new k-word-based method to analyze genome-specific properties. We define k-words consisting of the same number of bases as statistically identical k-words. The statistically identical k-words are estimated to appear at a similar frequency by statistical prediction. However, this may not be true in the genome because it is not a random list of bases. The ratio between frequencies of two statistically identical k-words can then be used to investigate the statistical specificity of the genome reflected in the k-word profile. In order to find important ratios representing genomic characteristics, a reference value is calculated that results in a minimum error when classifying data by ratio alone. Finally, we propose a genetic algorithm-based search algorithm to select a minimum set of ratios useful for classification. The proposed method was applied to the full-length sequence of microorganisms for pathogenicity classification. The classification accuracy of the proposed algorithm was similar to that of conventional methods while using only a few features. We proposed a new method to investigate the genome-specific statistical specificity in the k-word profile which can be applied to find important properties of the genome and classify genome sequences.

摘要

个体特征是通过由复杂碱基组合构成的基因组来确定的。这种碱基组合反映在k字谱中,k字谱代表连续k个碱基的数量。因此,分析k字谱中基因组特异性的统计特异性对于理解基因组特征很重要。在本文中,我们提出了一种基于k字的新方法来分析基因组特异性属性。我们将由相同数量碱基组成的k字定义为统计上相同的k字。通过统计预测估计统计上相同的k字会以相似的频率出现。然而,在基因组中这可能并不成立,因为它不是一个随机的碱基列表。然后,两个统计上相同的k字的频率之比可用于研究k字谱中反映的基因组的统计特异性。为了找到代表基因组特征的重要比率,计算一个参考值,该参考值在仅按比率对数据进行分类时会导致最小误差。最后,我们提出了一种基于遗传算法的搜索算法来选择一组对分类有用的最小比率集。所提出的方法应用于微生物的全长序列进行致病性分类。所提出算法的分类准确率与传统方法相似,同时仅使用了少数特征。我们提出了一种新方法来研究k字谱中基因组特异性的统计特异性,该方法可用于发现基因组的重要属性并对基因组序列进行分类。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/08bf/8983152/02b2d98323f6/cho4-3009055.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/08bf/8983152/abf3a48abe4e/cho1-3009055.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/08bf/8983152/e8791a42857a/cho2-3009055.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/08bf/8983152/67da45f1a074/cho3-3009055.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/08bf/8983152/02b2d98323f6/cho4-3009055.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/08bf/8983152/abf3a48abe4e/cho1-3009055.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/08bf/8983152/e8791a42857a/cho2-3009055.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/08bf/8983152/67da45f1a074/cho3-3009055.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/08bf/8983152/02b2d98323f6/cho4-3009055.jpg

相似文献

1
Specificity Analysis of Genome Based on Statistically Identical K-Words With Same Base Combination.基于具有相同碱基组合的统计上相同的K字的基因组特异性分析
IEEE Open J Eng Med Biol. 2020 Jul 14;1:214-219. doi: 10.1109/OJEMB.2020.3009055. eCollection 2020.
2
Classification of various genomic sequences based on distribution of repeated k-word.基于重复k字分布的各种基因组序列分类
Annu Int Conf IEEE Eng Med Biol Soc. 2017 Jul;2017:3894-3897. doi: 10.1109/EMBC.2017.8037707.
3
A new alignment free genome comparison algorithm based on statistically estimated feature frequency profile.一种基于统计估计特征频率分布的新型无比对基因组比较算法。
Annu Int Conf IEEE Eng Med Biol Soc. 2017 Jul;2017:4265-4268. doi: 10.1109/EMBC.2017.8037798.
4
Phylogenetic tree construction using trinucleotide usage profile (TUP).使用三核苷酸使用谱(TUP)构建系统发育树。
BMC Bioinformatics. 2016 Oct 6;17(Suppl 13):381. doi: 10.1186/s12859-016-1222-3.
5
Comprehensive Word-Level Classification of Screening Mammography Reports Using a Neural Network Sequence Labeling Approach.基于神经网络序列标注方法的乳腺 X 线摄影筛查报告的全面词级分类。
J Digit Imaging. 2019 Oct;32(5):685-692. doi: 10.1007/s10278-018-0141-4.
6
Using Markov model to improve word normalization algorithm for biological sequence comparison.使用马尔可夫模型改进生物序列比对的词法归一化算法。
Amino Acids. 2012 May;42(5):1867-77. doi: 10.1007/s00726-011-0906-2. Epub 2011 Apr 20.
7
The word landscape of the non-coding segments of the Arabidopsis thaliana genome.拟南芥基因组非编码区段的词汇景观。
BMC Genomics. 2009 Oct 8;10:463. doi: 10.1186/1471-2164-10-463.
8
On avoided words, absent words, and their application to biological sequence analysis.论避免出现的词、缺失的词及其在生物序列分析中的应用。
Algorithms Mol Biol. 2017 Mar 14;12:5. doi: 10.1186/s13015-017-0094-z. eCollection 2017.
9
[Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].[通过新型人类基因的电子克隆和实验验证对NCBI人类基因数据库中出现的模型参考序列的一些错误进行分析、鉴定和校正]
Yi Chuan Xue Bao. 2004 May;31(5):431-43.
10
Probabilistic topic modeling for the analysis and classification of genomic sequences.用于基因组序列分析和分类的概率主题建模
BMC Bioinformatics. 2015;16 Suppl 6(Suppl 6):S2. doi: 10.1186/1471-2105-16-S6-S2. Epub 2015 Apr 17.

引用本文的文献

1
Heuristic Analysis of Genomic Sequence Processing Models for High Efficiency Prediction: A Statistical Perspective.基于统计视角的基因组序列处理模型启发式分析以实现高效预测
Curr Genomics. 2022 Nov 18;23(5):299-317. doi: 10.2174/1389202923666220927105311.

本文引用的文献

1
Alignment-Free Sequence Analysis and Applications.无比对序列分析及其应用
Annu Rev Biomed Data Sci. 2018 Jul;1:93-114. doi: 10.1146/annurev-biodatasci-080917-013431. Epub 2018 Apr 25.
2
Genome classification improvements based on k-mer intervals in sequences.基于序列中 k-mer 间隔的基因组分类改进。
Genomics. 2019 Dec;111(6):1574-1582. doi: 10.1016/j.ygeno.2018.11.001. Epub 2018 Nov 13.
3
GenBank.GenBank。
Nucleic Acids Res. 2019 Jan 8;47(D1):D94-D99. doi: 10.1093/nar/gky989.
4
DNA sequencing at 40: past, present and future.DNA 测序 40 年:过去、现在与未来。
Nature. 2017 Oct 19;550(7676):345-353. doi: 10.1038/nature24286. Epub 2017 Oct 11.
5
Alignment-free sequence comparison: benefits, applications, and tools.无比对信息的序列比对:优势、应用和工具。
Genome Biol. 2017 Oct 3;18(1):186. doi: 10.1186/s13059-017-1319-7.
6
K-mer Content, Correlation, and Position Analysis of Genome DNA Sequences for the Identification of Function and Evolutionary Features.用于功能和进化特征识别的基因组DNA序列的K-mer含量、相关性及位置分析
Genes (Basel). 2017 Apr 19;8(4):122. doi: 10.3390/genes8040122.
7
Viral Phylogenomics Using an Alignment-Free Method: A Three-Step Approach to Determine Optimal Length of k-mer.基于无比对方法的病毒系统发生基因组学:确定 k-mer 最优长度的三步法。
Sci Rep. 2017 Jan 19;7:40712. doi: 10.1038/srep40712.
8
Fast and accurate phylogeny reconstruction using filtered spaced-word matches.使用过滤后的间隔词匹配进行快速准确的系统发育重建。
Bioinformatics. 2017 Apr 1;33(7):971-979. doi: 10.1093/bioinformatics/btw776.
9
Nullomers and High Order Nullomers in Genomic Sequences.基因组序列中的零聚体和高阶零聚体
PLoS One. 2016 Dec 1;11(12):e0164540. doi: 10.1371/journal.pone.0164540. eCollection 2016.
10
Inversion symmetry of DNA k-mer counts: validity and deviations.DNA k 元组计数的反演对称性:有效性与偏差
BMC Genomics. 2016 Aug 31;17(1):696. doi: 10.1186/s12864-016-3012-8.