从部分示例中学习序列比对的评分方案。

Learning scoring schemes for sequence alignment from partial examples.

作者信息

Kim Eagu, Kececioglu John

机构信息

Department of Computer Science, The University of Arizona, Tucson, AZ 85721, USA.

出版信息

IEEE/ACM Trans Comput Biol Bioinform. 2008 Oct-Dec;5(4):546-56. doi: 10.1109/TCBB.2008.57.

DOI:10.1109/TCBB.2008.57

PMID:18989042

Abstract

When aligning biological sequences, the choice of parameter values for the alignment scoring function is critical. Small changes in gap penalties, for example, can yield radically different alignments. A rigorous way to compute parameter values that are appropriate for aligning biological sequences is through inverse parametric sequence alignment. Given a collection of examples of biologically correct alignments, this is the problem of finding parameter values that make the scores of the example alignments close to those of optimal alignments for their sequences. We extend prior work on inverse parametric alignment to partial examples, which contain regions where the alignment is left unspecified, and to an improved formulation based on minimizing the average error between the score of an example and the score of an optimal alignment. Experiments on benchmark biological alignments show we can find parameters that generalize across protein families and that boost the accuracy of multiple sequence alignment by as much as 25 percent.

摘要

在比对生物序列时，比对评分函数参数值的选择至关重要。例如，空位罚分的微小变化可能会产生截然不同的比对结果。一种计算适用于比对生物序列的参数值的严谨方法是通过逆参数序列比对。给定一组生物学上正确的比对示例，问题在于找到能使示例比对的得分接近其序列最优比对得分的参数值。我们将先前关于逆参数比对的工作扩展到部分示例，这些示例包含比对未明确指定的区域，并扩展到基于最小化示例得分与最优比对得分之间平均误差的改进公式。在基准生物比对上的实验表明，我们能够找到适用于多个蛋白质家族的参数，并且能将多序列比对的准确率提高多达25%。