Smith Jennifer A
Electrical and Computer Engineering Department, Boise State University, 1910 University Ave., Boise, ID 83725-2075, USA.
IEEE/ACM Trans Comput Biol Bioinform. 2009 Jul-Sep;6(3):517-27. doi: 10.1109/TCBB.2008.120.
The use of partial covariance models to search for RNA family members in genomic sequence databases is explored. The partial models are formed from contiguous subranges of the overall RNA family multiple alignment columns. A binary decision-tree framework is presented for choosing the order to apply the partial models and the score thresholds on which to make the decisions. The decision trees are chosen to minimize computation time subject to the constraint that all of the training sequences are passed to the full covariance model for final evaluation. Computational intelligence methods are suggested to select the decision tree since the tree can be quite complex and there is no obvious method to build the tree in these cases. Experimental results from seven RNA families shows execution times of 0.066-0.268 relative to using the full covariance model alone. Tests on the full sets of known sequences for each family show that at least 95 percent of these sequences are found for two families and 100 percent for five others. Since the full covariance model is run on all sequences accepted by the partial model decision tree, the false alarm rate is at least as low as that of the full model alone.
本文探讨了使用部分协方差模型在基因组序列数据库中搜索RNA家族成员的方法。部分模型由整个RNA家族多序列比对列的连续子范围构成。提出了一个二元决策树框架,用于选择应用部分模型的顺序以及做出决策时所依据的得分阈值。选择决策树的目的是在所有训练序列都传递给完整协方差模型进行最终评估的约束条件下,使计算时间最短。由于决策树可能相当复杂且在这些情况下没有明显的构建方法,因此建议使用计算智能方法来选择决策树。来自七个RNA家族的实验结果表明,相对于单独使用完整协方差模型,执行时间为0.066 - 0.268。对每个家族的已知序列全集进行测试表明,其中两个家族至少发现了95%的序列,另外五个家族则发现了100%的序列。由于完整协方差模型会对部分模型决策树接受的所有序列运行,因此误报率至少与单独使用完整模型时一样低。