Suppr超能文献

通过二级结构预测进行蛋白质多序列比对基准测试。

Protein multiple sequence alignment benchmarking through secondary structure prediction.

作者信息

Le Quan, Sievers Fabian, Higgins Desmond G

机构信息

Conway Institute, UCD School of Medicine and Medical Science, University College Dublin, Belfield, Dublin, Dublin 4, Ireland.

出版信息

Bioinformatics. 2017 May 1;33(9):1331-1337. doi: 10.1093/bioinformatics/btw840.

Abstract

MOTIVATION

Multiple sequence alignment (MSA) is commonly used to analyze sets of homologous protein or DNA sequences. This has lead to the development of many methods and packages for MSA over the past 30 years. Being able to compare different methods has been problematic and has relied on gold standard benchmark datasets of 'true' alignments or on MSA simulations. A number of protein benchmark datasets have been produced which rely on a combination of manual alignment and/or automated superposition of protein structures. These are either restricted to very small MSAs with few sequences or require manual alignment which can be subjective. In both cases, it remains very difficult to properly test MSAs of more than a few dozen sequences. PREFAB and HomFam both rely on using a small subset of sequences of known structure and do not fairly test the quality of a full MSA.

RESULTS

In this paper we describe QuanTest, a fully automated and highly scalable test system for protein MSAs which is based on using secondary structure prediction accuracy (SSPA) to measure alignment quality. This is based on the assumption that better MSAs will give more accurate secondary structure predictions when we include sequences of known structure. SSPA measures the quality of an entire alignment however, not just the accuracy on a handful of selected sequences. It can be scaled to alignments of any size but here we demonstrate its use on alignments of either 200 or 1000 sequences. This allows the testing of slow accurate programs as well as faster, less accurate ones. We show that the scores from QuanTest are highly correlated with existing benchmark scores. We also validate the method by comparing a wide range of MSA alignment options and by including different levels of mis-alignment into MSA, and examining the effects on the scores.

AVAILABILITY AND IMPLEMENTATION

QuanTest is available from http://www.bioinf.ucd.ie/download/QuanTest.tgz.

CONTACT

quan.le@ucd.ie.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

多序列比对(MSA)常用于分析同源蛋白质或DNA序列集。在过去30年中,这促使了许多用于MSA的方法和软件包的开发。能够比较不同的方法一直存在问题,并且依赖于“真实”比对的金标准基准数据集或MSA模拟。已经产生了一些蛋白质基准数据集,这些数据集依赖于蛋白质结构的手动比对和/或自动叠加的组合。这些数据集要么限于序列很少的非常小的MSA,要么需要可能具有主观性的手动比对。在这两种情况下,要正确测试几十个以上序列的MSA仍然非常困难。PREFAB和HomFam都依赖于使用一小部分已知结构的序列,并且不能公平地测试完整MSA的质量。

结果

在本文中,我们描述了QuanTest,这是一种用于蛋白质MSA的全自动且高度可扩展的测试系统,它基于使用二级结构预测准确性(SSPA)来衡量比对质量。这基于这样的假设,即当我们纳入已知结构的序列时,更好的MSA将给出更准确的二级结构预测。然而,SSPA衡量的是整个比对的质量,而不仅仅是少数选定序列上的准确性。它可以扩展到任何大小的比对,但在这里我们展示了它在200个或1000个序列比对上的应用。这允许测试慢速但准确的程序以及更快但准确性较低的程序。我们表明,QuanTest的分数与现有的基准分数高度相关。我们还通过比较广泛的MSA比对选项,并在MSA中纳入不同程度的错配,并检查其对分数的影响,来验证该方法。

可用性和实现

可从http://www.bioinf.ucd.ie/download/QuanTest.tgz获得QuanTest。

联系方式

quan.le@ucd.ie

补充信息

补充数据可在《生物信息学》在线获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9015/5408826/cf5eab4210f5/btw840f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验