• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

CFSP:一种用于核酸序列分类的协作频繁序列模式发现算法。

CFSP: a collaborative frequent sequence pattern discovery algorithm for nucleic acid sequence classification.

作者信息

Peng He

机构信息

School of Information Science and Engineering, Xiamen University, Xiamen, Fujian, China.

出版信息

PeerJ. 2020 Apr 20;8:e8965. doi: 10.7717/peerj.8965. eCollection 2020.

DOI:10.7717/peerj.8965
PMID:32341900
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7179567/
Abstract

BACKGROUND

Conserved nucleic acid sequences play an essential role in transcriptional regulation. The motifs/templates derived from nucleic acid sequence datasets are usually used as biomarkers to predict biochemical properties such as protein binding sites or to identify specific non-coding RNAs. In many cases, template-based nucleic acid sequence classification performs better than some feature extraction methods, such as N-gram and k-spaced pairs classification. The availability of large-scale experimental data provides an unprecedented opportunity to improve motif extraction methods. The process for pattern extraction from large-scale data is crucial for the creation of predictive models.

METHODS

In this article, a Teiresias-like feature extraction algorithm to discover frequent sub-sequences (CFSP) is proposed. Although gaps are allowed in some motif discovery algorithms, the distance and number of gaps are limited. The proposed algorithm can find frequent sequence pairs with a larger gap. The combinations of frequent sub-sequences in given protracted sequences capture the long-distance correlation, which implies a specific molecular biological property. Hence, the proposed algorithm intends to discover the combinations. A set of frequent sub-sequences derived from nucleic acid sequences with order is used as a base frequent sub-sequence array. The mutation information is attached to each sub-sequence array to implement fuzzy matching. Thus, a mutate records a single nucleotide variant or nucleotides insertion/deletion (indel) to encode a slight difference between frequent sequences and a matched subsequence of a sequence under investigation.

CONCLUSIONS

The proposed algorithm has been validated with several nucleic acid sequence prediction case studies. These data demonstrate better results than the recently available feature descriptors based methods based on experimental data sets such as miRNA, piRNA, and Sigma 54 promoters. CFSP is implemented in C++ and shell script; the source code and related data are available at https://github.com/HePeng2016/CFSP.

摘要

背景

保守核酸序列在转录调控中起着至关重要的作用。从核酸序列数据集中衍生出的基序/模板通常用作生物标志物,以预测诸如蛋白质结合位点等生化特性,或识别特定的非编码RNA。在许多情况下,基于模板的核酸序列分类比某些特征提取方法表现更好,例如N-gram和k间隔对分类。大规模实验数据的可用性为改进基序提取方法提供了前所未有的机会。从大规模数据中提取模式的过程对于创建预测模型至关重要。

方法

本文提出了一种类似提瑞西阿斯的特征提取算法来发现频繁子序列(CFSP)。虽然在一些基序发现算法中允许有间隙,但间隙的距离和数量是有限的。所提出的算法可以找到具有更大间隙的频繁序列对。给定延长序列中频繁子序列的组合捕获了长距离相关性,这意味着特定的分子生物学特性。因此,所提出的算法旨在发现这些组合。一组按顺序从核酸序列衍生的频繁子序列用作基本频繁子序列数组。将突变信息附加到每个子序列数组以实现模糊匹配。因此,一个突变记录一个单核苷酸变体或核苷酸插入/缺失(indel),以编码频繁序列与被研究序列的匹配子序列之间的细微差异。

结论

所提出的算法已通过几个核酸序列预测案例研究得到验证。这些数据表明,与最近基于实验数据集(如miRNA、piRNA和Sigma 54启动子)的基于特征描述符的方法相比,结果更好。CFSP用C++和 shell脚本实现;源代码和相关数据可在https://github.com/HePeng2016/CFSP获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2d99/7179567/c0e00aceb42f/peerj-08-8965-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2d99/7179567/5f6a0688fb00/peerj-08-8965-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2d99/7179567/b111893e0653/peerj-08-8965-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2d99/7179567/d52fef9455ac/peerj-08-8965-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2d99/7179567/c0e00aceb42f/peerj-08-8965-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2d99/7179567/5f6a0688fb00/peerj-08-8965-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2d99/7179567/b111893e0653/peerj-08-8965-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2d99/7179567/d52fef9455ac/peerj-08-8965-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2d99/7179567/c0e00aceb42f/peerj-08-8965-g004.jpg

相似文献

1
CFSP: a collaborative frequent sequence pattern discovery algorithm for nucleic acid sequence classification.CFSP:一种用于核酸序列分类的协作频繁序列模式发现算法。
PeerJ. 2020 Apr 20;8:e8965. doi: 10.7717/peerj.8965. eCollection 2020.
2
WildSpan: mining structured motifs from protein sequences.WildSpan:从蛋白质序列中挖掘结构化基序
Algorithms Mol Biol. 2011 Mar 31;6(1):6. doi: 10.1186/1748-7188-6-6.
3
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
4
A tree-based approach for motif discovery and sequence classification.基于树的 motif 发现和序列分类方法。
Bioinformatics. 2011 Aug 1;27(15):2054-61. doi: 10.1093/bioinformatics/btr353. Epub 2011 Jun 17.
5
Prediction of cis-regulatory elements: from high-information content analysis to motif identification.顺式调控元件的预测:从高信息含量分析到基序识别
J Bioinform Comput Biol. 2007 Aug;5(4):817-38. doi: 10.1142/s021972000700293x.
6
Identifying GPCR-drug interaction based on wordbook learning from sequences.基于序列词表学习的 GPCR 药物相互作用识别。
BMC Bioinformatics. 2020 Apr 20;21(1):150. doi: 10.1186/s12859-020-3488-8.
7
Discovering interesting motif-sets for multi-class protein sequence classification.发现用于多类蛋白质序列分类的有趣基序集。
J Comput Biol. 2010 May;17(5):733-43. doi: 10.1089/cmb.2008.0213.
8
piRNA identification based on motif discovery.基于基序发现的piRNA鉴定。
Mol Biosyst. 2014 Dec;10(12):3075-80. doi: 10.1039/c4mb00447g.
9
An efficient, versatile and scalable pattern growth approach to mine frequent patterns in unaligned protein sequences.一种用于挖掘未比对蛋白质序列中频繁模式的高效、通用且可扩展的模式增长方法。
Bioinformatics. 2007 Mar 15;23(6):687-93. doi: 10.1093/bioinformatics/btl665. Epub 2007 Jan 19.
10
IncMD: incremental trie-based structural motif discovery algorithm.IncMD:基于增量前缀树的结构基序发现算法。
J Bioinform Comput Biol. 2014 Oct;12(5):1450027. doi: 10.1142/S0219720014500279.

引用本文的文献

1
A computational model for GPCR-ligand interaction prediction.一种用于预测 GPCR-配体相互作用的计算模型。
J Integr Bioinform. 2020 Dec 29;18(2):155-165. doi: 10.1515/jib-2019-0084.

本文引用的文献

1
Multi-view Co-training for microRNA Prediction.多视图协同训练在 microRNA 预测中的应用。
Sci Rep. 2019 Jul 29;9(1):10931. doi: 10.1038/s41598-019-47399-8.
2
Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX).蛋白质序列的概率可变长度分割用于判别基序发现 (DiMotif) 和序列嵌入 (ProtVecX)。
Sci Rep. 2019 Mar 5;9(1):3577. doi: 10.1038/s41598-019-38746-w.
3
DiTaxa: nucleotide-pair encoding of 16S rRNA for host phenotype and biomarker detection.
迪塔克萨:用于宿主表型和生物标志物检测的 16S rRNA 的核苷酸对编码。
Bioinformatics. 2019 Jul 15;35(14):2498-2500. doi: 10.1093/bioinformatics/bty954.
4
HH-MOTiF: de novo detection of short linear motifs in proteins by Hidden Markov Model comparisons.HH-MOTiF:通过隐马尔可夫模型比较在蛋白质中从头检测短线性基序。
Nucleic Acids Res. 2017 Jul 3;45(W1):W470-W477. doi: 10.1093/nar/gkx341.
5
gkmSVM: an R package for gapped-kmer SVM.gkmSVM:一个用于带间隔k-mer支持向量机的R软件包。
Bioinformatics. 2016 Jul 15;32(14):2205-7. doi: 10.1093/bioinformatics/btw203. Epub 2016 Apr 19.
6
Assessment of k-mer spectrum applicability for metagenomic dissimilarity analysis.用于宏基因组差异分析的k-mer谱适用性评估。
BMC Bioinformatics. 2016 Jan 16;17:38. doi: 10.1186/s12859-015-0875-7.
7
Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics.用于深度蛋白质组学和基因组学的生物序列连续分布式表示
PLoS One. 2015 Nov 10;10(11):e0141287. doi: 10.1371/journal.pone.0141287. eCollection 2015.
8
miRNAfe: A comprehensive tool for feature extraction in microRNA prediction.miRNAfe:一种用于微小RNA预测中特征提取的综合工具。
Biosystems. 2015 Dec;138:1-5. doi: 10.1016/j.biosystems.2015.10.003. Epub 2015 Oct 20.
9
A framework for improving microRNA prediction in non-human genomes.一种用于改善非人类基因组中微小RNA预测的框架。
Nucleic Acids Res. 2015 Nov 16;43(20):e138. doi: 10.1093/nar/gkv698. Epub 2015 Jul 10.
10
SeqGL Identifies Context-Dependent Binding Signals in Genome-Wide Regulatory Element Maps.SeqGL在全基因组调控元件图谱中识别上下文相关的结合信号。
PLoS Comput Biol. 2015 May 27;11(5):e1004271. doi: 10.1371/journal.pcbi.1004271. eCollection 2015 May.