• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过二级结构预测进行蛋白质多序列比对基准测试。

Protein multiple sequence alignment benchmarking through secondary structure prediction.

作者信息

Le Quan, Sievers Fabian, Higgins Desmond G

机构信息

Conway Institute, UCD School of Medicine and Medical Science, University College Dublin, Belfield, Dublin, Dublin 4, Ireland.

出版信息

Bioinformatics. 2017 May 1;33(9):1331-1337. doi: 10.1093/bioinformatics/btw840.

DOI:10.1093/bioinformatics/btw840
PMID:28093407
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5408826/
Abstract

MOTIVATION

Multiple sequence alignment (MSA) is commonly used to analyze sets of homologous protein or DNA sequences. This has lead to the development of many methods and packages for MSA over the past 30 years. Being able to compare different methods has been problematic and has relied on gold standard benchmark datasets of 'true' alignments or on MSA simulations. A number of protein benchmark datasets have been produced which rely on a combination of manual alignment and/or automated superposition of protein structures. These are either restricted to very small MSAs with few sequences or require manual alignment which can be subjective. In both cases, it remains very difficult to properly test MSAs of more than a few dozen sequences. PREFAB and HomFam both rely on using a small subset of sequences of known structure and do not fairly test the quality of a full MSA.

RESULTS

In this paper we describe QuanTest, a fully automated and highly scalable test system for protein MSAs which is based on using secondary structure prediction accuracy (SSPA) to measure alignment quality. This is based on the assumption that better MSAs will give more accurate secondary structure predictions when we include sequences of known structure. SSPA measures the quality of an entire alignment however, not just the accuracy on a handful of selected sequences. It can be scaled to alignments of any size but here we demonstrate its use on alignments of either 200 or 1000 sequences. This allows the testing of slow accurate programs as well as faster, less accurate ones. We show that the scores from QuanTest are highly correlated with existing benchmark scores. We also validate the method by comparing a wide range of MSA alignment options and by including different levels of mis-alignment into MSA, and examining the effects on the scores.

AVAILABILITY AND IMPLEMENTATION

QuanTest is available from http://www.bioinf.ucd.ie/download/QuanTest.tgz.

CONTACT

quan.le@ucd.ie.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

多序列比对(MSA)常用于分析同源蛋白质或DNA序列集。在过去30年中,这促使了许多用于MSA的方法和软件包的开发。能够比较不同的方法一直存在问题,并且依赖于“真实”比对的金标准基准数据集或MSA模拟。已经产生了一些蛋白质基准数据集,这些数据集依赖于蛋白质结构的手动比对和/或自动叠加的组合。这些数据集要么限于序列很少的非常小的MSA,要么需要可能具有主观性的手动比对。在这两种情况下,要正确测试几十个以上序列的MSA仍然非常困难。PREFAB和HomFam都依赖于使用一小部分已知结构的序列,并且不能公平地测试完整MSA的质量。

结果

在本文中,我们描述了QuanTest,这是一种用于蛋白质MSA的全自动且高度可扩展的测试系统,它基于使用二级结构预测准确性(SSPA)来衡量比对质量。这基于这样的假设,即当我们纳入已知结构的序列时,更好的MSA将给出更准确的二级结构预测。然而,SSPA衡量的是整个比对的质量,而不仅仅是少数选定序列上的准确性。它可以扩展到任何大小的比对,但在这里我们展示了它在200个或1000个序列比对上的应用。这允许测试慢速但准确的程序以及更快但准确性较低的程序。我们表明,QuanTest的分数与现有的基准分数高度相关。我们还通过比较广泛的MSA比对选项,并在MSA中纳入不同程度的错配,并检查其对分数的影响,来验证该方法。

可用性和实现

可从http://www.bioinf.ucd.ie/download/QuanTest.tgz获得QuanTest。

联系方式

quan.le@ucd.ie。

补充信息

补充数据可在《生物信息学》在线获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9015/5408826/7cce8a57bf80/btw840f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9015/5408826/cf5eab4210f5/btw840f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9015/5408826/fff8e23ea820/btw840f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9015/5408826/b89ed4b72826/btw840f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9015/5408826/7cce8a57bf80/btw840f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9015/5408826/cf5eab4210f5/btw840f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9015/5408826/fff8e23ea820/btw840f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9015/5408826/b89ed4b72826/btw840f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9015/5408826/7cce8a57bf80/btw840f4.jpg

相似文献

1
Protein multiple sequence alignment benchmarking through secondary structure prediction.通过二级结构预测进行蛋白质多序列比对基准测试。
Bioinformatics. 2017 May 1;33(9):1331-1337. doi: 10.1093/bioinformatics/btw840.
2
QuanTest2: benchmarking multiple sequence alignments using secondary structure prediction.QuanTest2:利用二级结构预测对多序列比对进行基准测试。
Bioinformatics. 2020 Jan 1;36(1):90-95. doi: 10.1093/bioinformatics/btz552.
3
Using de novo protein structure predictions to measure the quality of very large multiple sequence alignments.使用从头蛋白质结构预测来衡量非常大的多序列比对的质量。
Bioinformatics. 2016 Mar 15;32(6):814-20. doi: 10.1093/bioinformatics/btv592. Epub 2015 Nov 14.
4
Making automated multiple alignments of very large numbers of protein sequences.对大量蛋白质序列进行自动多重比对。
Bioinformatics. 2013 Apr 15;29(8):989-95. doi: 10.1093/bioinformatics/btt093. Epub 2013 Feb 21.
5
APDB: a novel measure for benchmarking sequence alignment methods without reference alignments.APDB:一种用于在没有参考比对的情况下对序列比对方法进行基准测试的新方法。
Bioinformatics. 2003;19 Suppl 1:i215-21. doi: 10.1093/bioinformatics/btg1029.
6
OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy.OXBench:一种用于评估蛋白质多序列比对准确性的基准。
BMC Bioinformatics. 2003 Oct 10;4:47. doi: 10.1186/1471-2105-4-47.
7
OD-seq: outlier detection in multiple sequence alignments.OD-seq:多序列比对中的异常值检测。
BMC Bioinformatics. 2015 Aug 25;16:269. doi: 10.1186/s12859-015-0702-1.
8
Application of the MAFFT sequence alignment program to large data-reexamination of the usefulness of chained guide trees.将MAFFT序列比对程序应用于对链式引导树实用性的大数据重新检验。
Bioinformatics. 2016 Nov 1;32(21):3246-3251. doi: 10.1093/bioinformatics/btw412. Epub 2016 Jul 4.
9
Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment.通过通用相似性度量对生物序列和结构进行基于压缩的分类:实验评估
BMC Bioinformatics. 2007 Jul 13;8:252. doi: 10.1186/1471-2105-8-252.
10
DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment.DECIPHER:利用局部序列上下文来改进蛋白质多序列比对。
BMC Bioinformatics. 2015 Oct 6;16:322. doi: 10.1186/s12859-015-0749-z.

引用本文的文献

1
Beta Sandwich-Like Folds: Sequences, Contacts, Classification of Invariant Substructures and Beta Sandwich Protein Grammar.β三明治样折叠:序列、接触、不变子结构分类和β三明治蛋白语法。
Methods Mol Biol. 2025;2870:51-62. doi: 10.1007/978-1-0716-4213-9_4.
2
Deep learning for protein secondary structure prediction: Pre and post-AlphaFold.用于蛋白质二级结构预测的深度学习:AlphaFold之前与之后。
Comput Struct Biotechnol J. 2022 Nov 11;20:6271-6286. doi: 10.1016/j.csbj.2022.11.012. eCollection 2022.
3
RNAlign2D: a rapid method for combined RNA structure and sequence-based alignment using a pseudo-amino acid substitution matrix.

本文引用的文献

1
Reduction, alignment and visualisation of large diverse sequence families.大型多样序列家族的归约、比对与可视化
BMC Bioinformatics. 2016 Aug 2;17(1):300. doi: 10.1186/s12859-016-1059-9.
2
Multiple sequence alignment modeling: methods and applications.多序列比对建模:方法与应用
Brief Bioinform. 2016 Nov;17(6):1009-1023. doi: 10.1093/bib/bbv099. Epub 2015 Nov 27.
3
Using de novo protein structure predictions to measure the quality of very large multiple sequence alignments.使用从头蛋白质结构预测来衡量非常大的多序列比对的质量。
RNAlign2D:一种使用伪氨基酸替换矩阵进行 RNA 结构和基于序列联合比对的快速方法。
BMC Bioinformatics. 2021 Oct 16;22(1):504. doi: 10.1186/s12859-021-04426-8.
4
Deep learning methods in protein structure prediction.蛋白质结构预测中的深度学习方法。
Comput Struct Biotechnol J. 2020 Jan 22;18:1301-1310. doi: 10.1016/j.csbj.2019.12.011. eCollection 2020.
5
A Structurally-Validated Multiple Sequence Alignment of 497 Human Protein Kinase Domains.497 个人类蛋白激酶结构域的结构验证多重序列比对。
Sci Rep. 2019 Dec 24;9(1):19790. doi: 10.1038/s41598-019-56499-4.
6
The Dundee Resource for Sequence Analysis and Structure Prediction.邓迪序列分析与结构预测资源库。
Protein Sci. 2020 Jan;29(1):277-297. doi: 10.1002/pro.3783. Epub 2019 Nov 28.
7
A subfamily roadmap of the evolutionarily diverse glycoside hydrolase family 16 (GH16).进化多样化的糖苷水解酶家族 16(GH16)的亚家族路线图。
J Biol Chem. 2019 Nov 1;294(44):15973-15986. doi: 10.1074/jbc.RA119.010619. Epub 2019 Sep 9.
8
QuanTest2: benchmarking multiple sequence alignments using secondary structure prediction.QuanTest2:利用二级结构预测对多序列比对进行基准测试。
Bioinformatics. 2020 Jan 1;36(1):90-95. doi: 10.1093/bioinformatics/btz552.
9
A novel algorithm for parameter estimation of Hidden Markov Model inspired by Ant Colony Optimization.一种受蚁群优化启发的隐马尔可夫模型参数估计算法。
Heliyon. 2019 Mar 8;5(3):e01299. doi: 10.1016/j.heliyon.2019.e01299. eCollection 2019 Mar.
10
Evaluating Statistical Multiple Sequence Alignment in Comparison to Other Alignment Methods on Protein Data Sets.评估统计多重序列比对与蛋白质数据集上其他比对方法的比较。
Syst Biol. 2019 May 1;68(3):396-411. doi: 10.1093/sysbio/syy068.
Bioinformatics. 2016 Mar 15;32(6):814-20. doi: 10.1093/bioinformatics/btv592. Epub 2015 Nov 14.
4
JPred4: a protein secondary structure prediction server.JPred4:一种蛋白质二级结构预测服务器。
Nucleic Acids Res. 2015 Jul 1;43(W1):W389-94. doi: 10.1093/nar/gkv332. Epub 2015 Apr 16.
5
Simple chained guide trees give poorer multiple sequence alignments than inferred trees in simulation and phylogenetic benchmarks.在模拟和系统发育基准测试中,简单的链式引导树给出的多序列比对结果比推断树的结果更差。
Proc Natl Acad Sci U S A. 2015 Jan 13;112(2):E99-100. doi: 10.1073/pnas.1417526112. Epub 2015 Jan 6.
6
Reply to Tan et al.: Differences between real and simulated proteins in multiple sequence alignments.对Tan等人的回复:多序列比对中真实蛋白质与模拟蛋白质之间的差异。
Proc Natl Acad Sci U S A. 2015 Jan 13;112(2):E101. doi: 10.1073/pnas.1419351112. Epub 2015 Jan 6.
7
Pfam: the protein families database.Pfam:蛋白质家族数据库。
Nucleic Acids Res. 2014 Jan;42(Database issue):D222-30. doi: 10.1093/nar/gkt1223. Epub 2013 Nov 27.
8
Who watches the watchmen? An appraisal of benchmarks for multiple sequence alignment.谁来监督监督者?多重序列比对基准的评估。
Methods Mol Biol. 2014;1079:59-73. doi: 10.1007/978-1-62703-646-7_4.
9
Making automated multiple alignments of very large numbers of protein sequences.对大量蛋白质序列进行自动多重比对。
Bioinformatics. 2013 Apr 15;29(8):989-95. doi: 10.1093/bioinformatics/btt093. Epub 2013 Feb 21.
10
MAFFT multiple sequence alignment software version 7: improvements in performance and usability.MAFFT 多序列比对软件版本 7:性能和易用性的改进。
Mol Biol Evol. 2013 Apr;30(4):772-80. doi: 10.1093/molbev/mst010. Epub 2013 Jan 16.