Department of Biochemistry and Molecular Biology, Dalhousie University, Sir Charles Tupper Medical Building, Halifax NS B3H 1X5, Canada.
Bioinformatics. 2009 Dec 1;25(23):3093-8. doi: 10.1093/bioinformatics/btp552. Epub 2009 Sep 21.
Aligning protein sequences with the best possible accuracy requires sophisticated algorithms. Since the optimal alignment is not guaranteed to be the correct one, it is expected that even the best alignment will contain sites that do not respect the assumption of positional homology. Because formulating rules to identify these sites is difficult, it is common practice to manually remove them. Although considered necessary in some cases, manual editing is time consuming and not reproducible. We present here an automated editing method based on the classification of 'valid' and 'invalid' sites.
A support vector machine (SVM) classifier is trained to reproduce the decisions made during manual editing with an accuracy of 95.0%. This implies that manual editing can be made reproducible and applied to large-scale analyses. We further demonstrate that it is possible to retrain/extend the training of the classifier by providing examples of multiple sequence alignment (MSA) annotation. Near optimal training can be achieved with only 1000 annotated sites, or roughly three samples of protein sequence alignments.
This method is implemented in the software MANUEL, licensed under the GPL. A web-based application for single and batch job is available at http://fester.cs.dal.ca/manuel.
Supplementary data are available at Bioinformatics online.
将蛋白质序列与尽可能准确的最佳精度对齐需要复杂的算法。由于最优对齐不一定是正确的,因此即使是最佳对齐也可能包含不遵守位置同源性假设的位置。由于制定识别这些位置的规则很困难,因此通常采用手动删除它们的方法。尽管在某些情况下被认为是必要的,但手动编辑既耗时又不可重复。我们在这里提出了一种基于“有效”和“无效”站点分类的自动编辑方法。
支持向量机(SVM)分类器经过训练,可重现手动编辑时的决策,准确率达到 95.0%。这意味着手动编辑可以实现可重复,并应用于大规模分析。我们进一步证明,通过提供多个序列比对(MSA)注释的示例,可以对分类器的训练进行重新训练/扩展。只需 1000 个注释站点,或者大约 3 个蛋白质序列比对样本,就可以实现近乎最优的训练。
该方法在 MANUEL 软件中实现,许可证为 GPL。可在 http://fester.cs.dal.ca/manuel 上获得用于单作业和批作业的基于网络的应用程序。
补充数据可在 Bioinformatics 在线获得。