• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于 k-mer 统计的基因序列分析模型构建。

Gene sequence analysis model construction based on k-mer statistics.

机构信息

School of Mathematics and Statistics, Heze University, Heze, China.

出版信息

PLoS One. 2024 Sep 12;19(9):e0306480. doi: 10.1371/journal.pone.0306480. eCollection 2024.

DOI:10.1371/journal.pone.0306480
PMID:39264950
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11392344/
Abstract

With the rapid development of biotechnology, gene sequencing methods are gradually improved. The structure of gene sequences is also more complex. However, the traditional sequence alignment method is difficult to deal with the complex gene sequence alignment work. In order to improve the efficiency of gene sequence analysis, D2 series method of k-mer statistics is selected to build the model of gene sequence alignment analysis. According to the structure of the foreground sequence, the sequence to be aligned can be cut by different lengths and divided into multiple subsequences. Finally, according to the selected subsequences, the maximum dissimilarity in the alignment results is determined as the statistical result. At the same time, the research also designed an application system for the sequence alignment analysis of the model. The experimental results showed that the statistical power of the sequence alignment analysis model was directly proportional to the sequence coverage and cutting length, and inversely proportional to the K value and module length. At the same time, the model was applied to the system designed in this paper. The maximum storage capacity of the system was 71 GB, the maximum disk capacity was 135 GB, and the running time was less than 2.0s. Therefore, the k-mer statistic sequence alignment model and system proposed in this study have considerable application value in gene alignment analysis.

摘要

随着生物技术的飞速发展,基因测序方法逐渐得到改进。基因序列的结构也更加复杂。然而,传统的序列比对方法很难处理复杂的基因序列比对工作。为了提高基因序列分析的效率,选择了 D2 系列的 k-mer 统计方法来构建基因序列比对分析模型。根据前景序列的结构,可以通过不同的长度切割待比对的序列,并将其分为多个子序列。最后,根据所选的子序列,确定对齐结果中的最大不相似性作为统计结果。同时,研究还设计了一个模型的序列比对分析应用系统。实验结果表明,序列比对分析模型的统计能力与序列覆盖率和切割长度成正比,与 K 值和模块长度成反比。同时,该模型应用于本文设计的系统中。该系统的最大存储容量为 71GB,最大磁盘容量为 135GB,运行时间小于 2.0s。因此,本研究提出的 k-mer 统计序列比对模型和系统在基因比对分析中具有相当大的应用价值。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c5d0/11392344/e1949d912f86/pone.0306480.g012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c5d0/11392344/c59ceedfe7e9/pone.0306480.g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c5d0/11392344/145b820ada80/pone.0306480.g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c5d0/11392344/2cef1b831dfb/pone.0306480.g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c5d0/11392344/e1949d912f86/pone.0306480.g012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c5d0/11392344/c59ceedfe7e9/pone.0306480.g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c5d0/11392344/145b820ada80/pone.0306480.g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c5d0/11392344/2cef1b831dfb/pone.0306480.g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c5d0/11392344/e1949d912f86/pone.0306480.g012.jpg

相似文献

1
Gene sequence analysis model construction based on k-mer statistics.基于 k-mer 统计的基因序列分析模型构建。
PLoS One. 2024 Sep 12;19(9):e0306480. doi: 10.1371/journal.pone.0306480. eCollection 2024.
2
A survey and evaluations of histogram-based statistics in alignment-free sequence comparison.基于直方图的无比对序列比较统计的调查与评估。
Brief Bioinform. 2019 Jul 19;20(4):1222-1237. doi: 10.1093/bib/bbx161.
3
Alignment-free sequence comparison (II): theoretical power of comparison statistics.无比对序列比较(II):比较统计量的理论功效
J Comput Biol. 2010 Nov;17(11):1467-90. doi: 10.1089/cmb.2010.0056. Epub 2010 Oct 25.
4
Alignment-Free Sequence Comparison With Multiple k Values.无比对信息的多 k 值序列比对。
IEEE/ACM Trans Comput Biol Bioinform. 2021 Sep-Oct;18(5):1841-1849. doi: 10.1109/TCBB.2019.2955081. Epub 2021 Oct 7.
5
KMC 2: fast and resource-frugal k-mer counting.KMC 2:快速且资源节约型的k-mer计数法
Bioinformatics. 2015 May 15;31(10):1569-76. doi: 10.1093/bioinformatics/btv022. Epub 2015 Jan 20.
6
CSA: an efficient algorithm to improve circular DNA multiple alignment.CSA:一种改进环状DNA多重比对的高效算法。
BMC Bioinformatics. 2009 Jul 23;10:230. doi: 10.1186/1471-2105-10-230.
7
A probabilistic measure for alignment-free sequence comparison.一种用于无比对序列比较的概率测度。
Bioinformatics. 2004 Dec 12;20(18):3455-61. doi: 10.1093/bioinformatics/bth426. Epub 2004 Jul 22.
8
Vertical decomposition with Genetic Algorithm for Multiple Sequence Alignment.基于遗传算法的多序列比对垂直分解。
BMC Bioinformatics. 2011 Aug 25;12:353. doi: 10.1186/1471-2105-12-353.
9
Using Gaussian model to improve biological sequence comparison.利用高斯模型改进生物序列比较。
J Comput Chem. 2010 Jan 30;31(2):351-61. doi: 10.1002/jcc.21322.
10
A local multiple alignment method for detection of non-coding RNA sequences.一种用于检测非编码RNA序列的局部多重比对方法。
Bioinformatics. 2009 Jun 15;25(12):1498-505. doi: 10.1093/bioinformatics/btp261. Epub 2009 Apr 17.

本文引用的文献

1
The novel repressor Rce2 competes with Ace3 to regulate cellulase gene expression in the filamentous fungus Trichoderma reesei.新型阻遏物 Rce2 与 Ace3 竞争,以调节丝状真菌里氏木霉纤维素酶基因的表达。
Mol Microbiol. 2021 Nov;116(5):1298-1314. doi: 10.1111/mmi.14825. Epub 2021 Oct 19.
2
Meta-analysis of 208370 East Asians identifies 113 susceptibility loci for systemic lupus erythematosus.对 208370 名东亚人进行的荟萃分析确定了系统性红斑狼疮的 113 个易感性位点。
Ann Rheum Dis. 2021 May;80(5):632-640. doi: 10.1136/annrheumdis-2020-219209. Epub 2020 Dec 3.
3
Sequence analysis of the Petunia inflata S-locus region containing 17 S-Locus F-Box genes and the S-RNase gene involved in self-incompatibility.
序列分析包含 17 个 S 基因座 F-Box 基因和参与自交不亲和的 S-RNase 基因的喇叭水仙 S 基因座区域。
Plant J. 2020 Dec;104(5):1348-1368. doi: 10.1111/tpj.15005. Epub 2020 Oct 30.
4
NoPeak: k-mer-based motif discovery in ChIP-Seq data without peak calling.NoPeak:无峰调用的 ChIP-Seq 数据中的基于 k-mer 的基序发现。
Bioinformatics. 2021 May 5;37(5):596-602. doi: 10.1093/bioinformatics/btaa845.
5
Hypofractionated Adjuvant Radiation Therapy Is Effective for Patients With Lymph Node-Positive Breast Cancer: A Population-Based Analysis.低分割辅助放疗对淋巴结阳性乳腺癌患者有效:基于人群的分析。
Int J Radiat Oncol Biol Phys. 2020 Dec 1;108(5):1150-1158. doi: 10.1016/j.ijrobp.2020.07.2313. Epub 2020 Jul 25.
6
Analyses of HIV-1 integrase gene sequences among treatment-naive patients in the Eastern Cape, South Africa.南非东开普省未经治疗的 HIV-1 整合酶基因序列分析。
J Med Virol. 2020 Aug;92(8):1165-1172. doi: 10.1002/jmv.25661. Epub 2020 Jan 17.
7
Analysis of serum cfDNA concentration and integrity before and after surgery in patients with lung cancer.肺癌患者手术前后血清游离DNA浓度及完整性分析
Cell Mol Biol (Noisy-le-grand). 2019 Jul 31;65(6):56-63.
8
Analysis of Endogenous Peptides Released from Osteoarthritic Cartilage Unravels Novel Pathogenic Markers.分析骨关节炎软骨中释放的内源性肽揭示了新的致病标志物。
Mol Cell Proteomics. 2019 Oct;18(10):2018-2028. doi: 10.1074/mcp.RA119.001554. Epub 2019 Jul 27.
9
High-throughput single-cell sequencing of paired TCRα and TCRβ genes for the direct expression-cloning and functional analysis of murine T-cell receptors.高通量单细胞测序技术用于对配对的 TCRα 和 TCRβ 基因进行直接表达克隆和功能分析,以研究小鼠 T 细胞受体。
Eur J Immunol. 2019 Aug;49(8):1269-1277. doi: 10.1002/eji.201848030. Epub 2019 May 2.
10
Complete genome sequence and in silico analysis of L. interrogans Canicola strain DU114: A virulent Brazilian isolate phylogenetically related to serovar Linhai.完整的 L. interrogans Canicola 菌株 DU114 基因组序列和计算机分析:与林海市血清型亲缘关系密切的具有毒力的巴西分离株。
Genomics. 2019 Dec;111(6):1651-1656. doi: 10.1016/j.ygeno.2018.11.015. Epub 2018 Nov 17.