使用决策树和部分协方差模型进行RNA搜索。

RNA search with decision trees and partial covariance models.

作者信息

Smith Jennifer A

机构信息

Electrical and Computer Engineering Department, Boise State University, 1910 University Ave., Boise, ID 83725-2075, USA.

出版信息

IEEE/ACM Trans Comput Biol Bioinform. 2009 Jul-Sep;6(3):517-27. doi: 10.1109/TCBB.2008.120.

DOI:10.1109/TCBB.2008.120

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3646588/

Abstract

The use of partial covariance models to search for RNA family members in genomic sequence databases is explored. The partial models are formed from contiguous subranges of the overall RNA family multiple alignment columns. A binary decision-tree framework is presented for choosing the order to apply the partial models and the score thresholds on which to make the decisions. The decision trees are chosen to minimize computation time subject to the constraint that all of the training sequences are passed to the full covariance model for final evaluation. Computational intelligence methods are suggested to select the decision tree since the tree can be quite complex and there is no obvious method to build the tree in these cases. Experimental results from seven RNA families shows execution times of 0.066-0.268 relative to using the full covariance model alone. Tests on the full sets of known sequences for each family show that at least 95 percent of these sequences are found for two families and 100 percent for five others. Since the full covariance model is run on all sequences accepted by the partial model decision tree, the false alarm rate is at least as low as that of the full model alone.

摘要

本文探讨了使用部分协方差模型在基因组序列数据库中搜索RNA家族成员的方法。部分模型由整个RNA家族多序列比对列的连续子范围构成。提出了一个二元决策树框架，用于选择应用部分模型的顺序以及做出决策时所依据的得分阈值。选择决策树的目的是在所有训练序列都传递给完整协方差模型进行最终评估的约束条件下，使计算时间最短。由于决策树可能相当复杂且在这些情况下没有明显的构建方法，因此建议使用计算智能方法来选择决策树。来自七个RNA家族的实验结果表明，相对于单独使用完整协方差模型，执行时间为0.066 - 0.268。对每个家族的已知序列全集进行测试表明，其中两个家族至少发现了95%的序列，另外五个家族则发现了100%的序列。由于完整协方差模型会对部分模型决策树接受的所有序列运行，因此误报率至少与单独使用完整模型时一样低。

相似文献

1

RNA search with decision trees and partial covariance models.使用决策树和部分协方差模型进行RNA搜索。

IEEE/ACM Trans Comput Biol Bioinform. 2009 Jul-Sep;6(3):517-27. doi: 10.1109/TCBB.2008.120.

2

Pair hidden Markov models on tree structures.树结构上的成对隐马尔可夫模型。

Bioinformatics. 2003;19 Suppl 1:i232-40. doi: 10.1093/bioinformatics/btg1032.

3

Using multiple alignments and phylogenetic trees to detect RNA secondary structure.使用多重比对和系统发育树来检测RNA二级结构。

Pac Symp Biocomput. 1996:350-67.

4

Alignments of RNA structures.RNA 结构的比对。

IEEE/ACM Trans Comput Biol Bioinform. 2010 Apr-Jun;7(2):309-22. doi: 10.1109/TCBB.2008.28.

5

Computational identification of functional RNA homologs in metagenomic data.计算鉴定宏基因组数据中的功能 RNA 同源物。

RNA Biol. 2013 Jul;10(7):1170-9. doi: 10.4161/rna.25038. Epub 2013 May 20.

6

Pure multiple RNA secondary structure alignments: a progressive profile approach.纯多重RNA二级结构比对：一种渐进式轮廓方法。

IEEE/ACM Trans Comput Biol Bioinform. 2004 Jan-Mar;1(1):53-62. doi: 10.1109/TCBB.2004.11.

7

Pair stochastic tree adjoining grammars for aligning and predicting pseudoknot RNA structures.用于比对和预测假结RNA结构的成对随机树邻接文法

Proc IEEE Comput Syst Bioinform Conf. 2004:290-9.

8

[Predicting RNA secondary structures including pseudoknots by covariance with stacking and minimum free energy].通过与堆积和最小自由能的协方差预测包括假结在内的RNA二级结构

Sheng Wu Gong Cheng Xue Bao. 2008 Apr;24(4):659-64.

9

Bayesian coestimation of phylogeny and sequence alignment.系统发育与序列比对的贝叶斯联合估计

BMC Bioinformatics. 2005 Apr 1;6:83. doi: 10.1186/1471-2105-6-83.

10

A memory efficient method for structure-based RNA multiple alignment.基于结构的 RNA 多重比对的一种内存高效方法。

IEEE/ACM Trans Comput Biol Bioinform. 2012 Jan-Feb;9(1):1-11. doi: 10.1109/TCBB.2011.86. Epub 2011 Apr 29.

引用本文的文献

1

A Machine Learning Approach for Accurate Annotation of Noncoding RNAs.一种用于非编码RNA精确注释的机器学习方法。

IEEE/ACM Trans Comput Biol Bioinform. 2015 May-Jun;12(3):551-9. doi: 10.1109/TCBB.2014.2366758.

2

Efficient known ncRNA search including pseudoknots.高效已知 ncRNA 搜索包括假结。

BMC Bioinformatics. 2013;14 Suppl 2(Suppl 2):S25. doi: 10.1186/1471-2105-14-S2-S25. Epub 2013 Jan 21.

本文引用的文献

1

RNAmmer: consistent and rapid annotation of ribosomal RNA genes.RNAmmer：核糖体RNA基因的一致性快速注释

Nucleic Acids Res. 2007;35(9):3100-8. doi: 10.1093/nar/gkm160. Epub 2007 Apr 22.

2

Query-dependent banding (QDB) for faster RNA similarity searches.用于更快RNA相似性搜索的查询依赖条带法（QDB）。

PLoS Comput Biol. 2007 Mar 30;3(3):e56. doi: 10.1371/journal.pcbi.0030056. Epub 2007 Feb 7.

3

Identification of 17 Pseudomonas aeruginosa sRNAs and prediction of sRNA-encoding genes in 10 diverse pathogens using the bioinformatic tool sRNAPredict2.利用生物信息学工具sRNAPredict2鉴定17种铜绿假单胞菌小RNA并预测10种不同病原体中的小RNA编码基因。

Nucleic Acids Res. 2006;34(12):3484-93. doi: 10.1093/nar/gkl453.

4

Sequence-based heuristics for faster annotation of non-coding RNA families.基于序列的启发式方法，用于更快地注释非编码RNA家族。

Bioinformatics. 2006 Jan 1;22(1):35-9. doi: 10.1093/bioinformatics/bti743. Epub 2005 Nov 2.

5

Pseudoknots: RNA structures with diverse functions.假结：具有多种功能的RNA结构

PLoS Biol. 2005 Jun;3(6):e213. doi: 10.1371/journal.pbio.0030213. Epub 2005 Jun 14.

6

Micros for microbes: non-coding regulatory RNAs in bacteria.微生物的微型分子：细菌中的非编码调控RNA

Trends Genet. 2005 Jul;21(7):399-404. doi: 10.1016/j.tig.2005.05.008.

7

MicC, a second small-RNA regulator of Omp protein expression in Escherichia coli.MicC，大肠杆菌中Omp蛋白表达的第二种小RNA调节因子。

J Bacteriol. 2004 Oct;186(20):6689-97. doi: 10.1128/JB.186.20.6689-6697.2004.

8

RSEARCH: finding homologs of single structured RNA sequences.研究：寻找单一结构化RNA序列的同源物。

BMC Bioinformatics. 2003 Sep 22;4:44. doi: 10.1186/1471-2105-4-44.

9

Rfam: an RNA family database.Rfam：一个RNA家族数据库。

Nucleic Acids Res. 2003 Jan 1;31(1):439-41. doi: 10.1093/nar/gkg006.

10

A bioinformatics based approach to discover small RNA genes in the Escherichia coli genome.一种基于生物信息学的方法来发现大肠杆菌基因组中的小RNA基因。

Biosystems. 2002 Mar-May;65(2-3):157-77. doi: 10.1016/s0303-2647(02)00013-8.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验