使用 SVM 分类器重现多重序列比对的手动注释。

Reproducing the manual annotation of multiple sequence alignments using a SVM classifier.

机构信息

Department of Biochemistry and Molecular Biology, Dalhousie University, Sir Charles Tupper Medical Building, Halifax NS B3H 1X5, Canada.

出版信息

Bioinformatics. 2009 Dec 1;25(23):3093-8. doi: 10.1093/bioinformatics/btp552. Epub 2009 Sep 21.

DOI:10.1093/bioinformatics/btp552

PMID:19770262

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2778337/

Abstract

MOTIVATION

Aligning protein sequences with the best possible accuracy requires sophisticated algorithms. Since the optimal alignment is not guaranteed to be the correct one, it is expected that even the best alignment will contain sites that do not respect the assumption of positional homology. Because formulating rules to identify these sites is difficult, it is common practice to manually remove them. Although considered necessary in some cases, manual editing is time consuming and not reproducible. We present here an automated editing method based on the classification of 'valid' and 'invalid' sites.

RESULTS

A support vector machine (SVM) classifier is trained to reproduce the decisions made during manual editing with an accuracy of 95.0%. This implies that manual editing can be made reproducible and applied to large-scale analyses. We further demonstrate that it is possible to retrain/extend the training of the classifier by providing examples of multiple sequence alignment (MSA) annotation. Near optimal training can be achieved with only 1000 annotated sites, or roughly three samples of protein sequence alignments.

AVAILABILITY

This method is implemented in the software MANUEL, licensed under the GPL. A web-based application for single and batch job is available at http://fester.cs.dal.ca/manuel.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

将蛋白质序列与尽可能准确的最佳精度对齐需要复杂的算法。由于最优对齐不一定是正确的，因此即使是最佳对齐也可能包含不遵守位置同源性假设的位置。由于制定识别这些位置的规则很困难，因此通常采用手动删除它们的方法。尽管在某些情况下被认为是必要的，但手动编辑既耗时又不可重复。我们在这里提出了一种基于“有效”和“无效”站点分类的自动编辑方法。

结果

支持向量机（SVM）分类器经过训练，可重现手动编辑时的决策，准确率达到 95.0%。这意味着手动编辑可以实现可重复，并应用于大规模分析。我们进一步证明，通过提供多个序列比对（MSA）注释的示例，可以对分类器的训练进行重新训练/扩展。只需 1000 个注释站点，或者大约 3 个蛋白质序列比对样本，就可以实现近乎最优的训练。

可用性

该方法在 MANUEL 软件中实现，许可证为 GPL。可在 http://fester.cs.dal.ca/manuel 上获得用于单作业和批作业的基于网络的应用程序。

补充信息

补充数据可在 Bioinformatics 在线获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f787/2778337/8f5d1e15377f/btp552f1.jpg

相似文献

Reproducing the manual annotation of multiple sequence alignments using a SVM classifier.使用 SVM 分类器重现多重序列比对的手动注释。

Bioinformatics. 2009 Dec 1;25(23):3093-8. doi: 10.1093/bioinformatics/btp552. Epub 2009 Sep 21.

Protein multiple sequence alignment benchmarking through secondary structure prediction.通过二级结构预测进行蛋白质多序列比对基准测试。

Bioinformatics. 2017 May 1;33(9):1331-1337. doi: 10.1093/bioinformatics/btw840.

AQUA: automated quality improvement for multiple sequence alignments.AQUA：多序列比对的自动化质量改进。

Bioinformatics. 2010 Jan 15;26(2):263-5. doi: 10.1093/bioinformatics/btp651. Epub 2009 Nov 19.

Jalview Version 2--a multiple sequence alignment editor and analysis workbench.Jalview 2版本——一个多序列比对编辑器和分析工作台。

Bioinformatics. 2009 May 1;25(9):1189-91. doi: 10.1093/bioinformatics/btp033. Epub 2009 Jan 16.

Rapid detection, classification and accurate alignment of up to a million or more related protein sequences.快速检测、分类和精确比对多达上百万条甚至更多的相关蛋白质序列。

Bioinformatics. 2009 Aug 1;25(15):1869-75. doi: 10.1093/bioinformatics/btp342. Epub 2009 Jun 8.

SVM-dependent pairwise HMM: an application to protein pairwise alignments.基于 SVM 的成对隐马尔可夫模型：在蛋白质两两比对中的应用。

Bioinformatics. 2017 Dec 15;33(24):3902-3908. doi: 10.1093/bioinformatics/btx391.

PROMALS web server for accurate multiple protein sequence alignments.用于精确多蛋白序列比对的PROMALS网络服务器。

Nucleic Acids Res. 2007 Jul;35(Web Server issue):W649-52. doi: 10.1093/nar/gkm227. Epub 2007 Apr 22.

Mining sequence annotation databanks for association patterns.挖掘序列注释数据库中的关联模式。

Bioinformatics. 2005 Nov 1;21 Suppl 3:iii49-57. doi: 10.1093/bioinformatics/bti1206.

Parallelization of MAFFT for large-scale multiple sequence alignments.并行化 MAFFT 进行大规模多序列比对。

Bioinformatics. 2018 Jul 15;34(14):2490-2492. doi: 10.1093/bioinformatics/bty121.

Support vector training of protein alignment models.蛋白质比对模型的支持向量训练

J Comput Biol. 2008 Sep;15(7):867-80. doi: 10.1089/cmb.2007.0152.

引用本文的文献

Teleost Fish-Specific Preferential Retention of Pigmentation Gene-Containing Families After Whole Genome Duplications in Vertebrates.硬骨鱼特异性地优先保留脊椎动物全基因组复制后含色素沉着基因的家族。

G3 (Bethesda). 2018 May 4;8(5):1795-1806. doi: 10.1534/g3.118.200201.

本文引用的文献

Fast statistical alignment.快速统计对齐

PLoS Comput Biol. 2009 May;5(5):e1000392. doi: 10.1371/journal.pcbi.1000392. Epub 2009 May 29.

A machine-learning approach reveals that alignment properties alone can accurately predict inference of lateral gene transfer from discordant phylogenies.一种机器学习方法表明，仅比对属性就能准确预测从不一致的系统发育中推断横向基因转移。

Mol Biol Evol. 2009 Sep;26(9):1931-9. doi: 10.1093/molbev/msp105. Epub 2009 May 14.

Characterization of pairwise and multiple sequence alignment errors.成对和多序列比对错误的特征描述。

Gene. 2009 Jul 15;441(1-2):141-7. doi: 10.1016/j.gene.2008.05.016. Epub 2008 Jun 3.

Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis.系统发育感知缺口放置可防止序列比对和进化分析中的错误。

Science. 2008 Jun 20;320(5883):1632-5. doi: 10.1126/science.1158395.

How well does the HoT score reflect sequence alignment accuracy?热图（HoT）分数在多大程度上反映了序列比对的准确性？

Mol Biol Evol. 2008 Aug;25(8):1576-80. doi: 10.1093/molbev/msn103. Epub 2008 May 4.

Alignment uncertainty and genomic analysis.比对不确定性与基因组分析。

Science. 2008 Jan 25;319(5862):473-6. doi: 10.1126/science.1151532.

Uncertainty in homology inferences: assessing and improving genomic sequence alignment.同源性推断中的不确定性：评估和改进基因组序列比对

Genome Res. 2008 Feb;18(2):298-309. doi: 10.1101/gr.6725608. Epub 2007 Dec 11.

The Pfam protein families database.Pfam蛋白质家族数据库。

Nucleic Acids Res. 2008 Jan;36(Database issue):D281-8. doi: 10.1093/nar/gkm960. Epub 2007 Nov 26.

Recent evolutions of multiple sequence alignment algorithms.多重序列比对算法的最新进展。

PLoS Comput Biol. 2007 Aug;3(8):e123. doi: 10.1371/journal.pcbi.0030123.

Automatic extraction of reliable regions from multiple sequence alignments.从多序列比对中自动提取可靠区域。

BMC Bioinformatics. 2007 May 24;8 Suppl 5(Suppl 5):S9. doi: 10.1186/1471-2105-8-S5-S9.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用 SVM 分类器重现多重序列比对的手动注释。

Reproducing the manual annotation of multiple sequence alignments using a SVM classifier.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY

SUPPLEMENTARY INFORMATION

动机

结果

可用性

补充信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献