DeBlasio Dan, Kececioglu John
Department of Computer Science, The University of Arizona, Tucson, AZ 85721 USA.
Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213 USA.
Algorithms Mol Biol. 2017 Apr 19;12:11. doi: 10.1186/s13015-017-0102-3. eCollection 2017.
In a computed protein multiple sequence alignment, the of a column is the fraction of its substitutions that are in so-called core columns of the gold-standard reference alignment of its proteins. In benchmark suites of protein reference alignments, the core columns of the reference alignment are those that can be confidently labeled as correct, usually due to all residues in the column being sufficiently close in the spatial superposition of the known three-dimensional structures of the proteins. Typically the accuracy of a protein multiple sequence alignment that has been computed for a benchmark is only measured with respect to the core columns of the reference alignment. When computing an alignment in practice, however, a reference alignment is not known, so the coreness of its columns can only be predicted.
We develop for the first time a of column coreness for protein multiple sequence alignments. This allows us to predict which columns of a computed alignment are core, and hence better estimate the alignment's accuracy. Our approach to predicting coreness is similar to nearest-neighbor classification from machine learning, except we transform nearest-neighbor distances into a coreness prediction via a regression function, and we learn an appropriate distance function through a new optimization formulation that solves a large-scale linear programming problem. We apply our coreness predictor to , the task of choosing parameter values for an aligner's scoring function to obtain a more accurate alignment of a specific set of sequences. We show that for this task, our predictor strongly outperforms other column-confidence estimators from the literature, and affords a substantial boost in alignment accuracy.
在计算得到的蛋白质多序列比对中,某一列的核心度是该列中替换位点在其蛋白质的金标准参考比对的所谓核心列中的占比。在蛋白质参考比对的基准测试集中,参考比对的核心列是那些能够被可靠地标记为正确的列,这通常是由于该列中的所有残基在蛋白质已知三维结构的空间叠加中足够接近。通常,针对基准测试计算得到的蛋白质多序列比对的准确性仅相对于参考比对的核心列进行衡量。然而,在实际计算比对时,参考比对是未知的,因此只能预测其列的核心度。
我们首次开发了一种用于蛋白质多序列比对的列核心度预测方法。这使我们能够预测计算得到的比对中哪些列是核心列,从而更好地估计比对的准确性。我们预测核心度的方法类似于机器学习中的最近邻分类,不同之处在于我们通过回归函数将最近邻距离转换为核心度预测,并且我们通过解决大规模线性规划问题的新优化公式来学习合适的距离函数。我们将核心度预测器应用于为比对器的评分函数选择参数值以获得特定序列集更准确比对的任务。我们表明,对于此任务,我们的预测器明显优于文献中的其他列置信度估计器,并能大幅提高比对准确性。