Virtual Laboratory of Biomolecular Structures, Marine Science Institute, College of Science, University of the Philippines Diliman, Quezon City 1101, Philippines.
BMC Bioinformatics. 2010 Feb 8;11:79. doi: 10.1186/1471-2105-11-79.
All polypeptide backbones have the potential to form amyloid fibrils, which are associated with a number of degenerative disorders. However, the likelihood that amyloidosis would actually occur under physiological conditions depends largely on the amino acid composition of a protein. We explore using a naive Bayesian classifier and a weighted decision tree for predicting the amyloidogenicity of immunoglobulin sequences.
The average accuracy based on leave-one-out (LOO) cross validation of a Bayesian classifier generated from 143 amyloidogenic sequences is 60.84%. This is consistent with the average accuracy of 61.15% for a holdout test set comprised of 103 AM and 28 non-amyloidogenic sequences. The LOO cross validation accuracy increases to 81.08% when the training set is augmented by the holdout test set. In comparison, the average classification accuracy for the holdout test set obtained using a decision tree is 78.64%. Non-amyloidogenic sequences are predicted with average LOO cross validation accuracies between 74.05% and 77.24% using the Bayesian classifier, depending on the training set size. The accuracy for the holdout test set was 89%. For the decision tree, the non-amyloidogenic prediction accuracy is 75.00%.
This exploratory study indicates that both classification methods may be promising in providing straightforward predictions on the amyloidogenicity of a sequence. Nevertheless, the number of available sequences that satisfy the premises of this study are limited, and are consequently smaller than the ideal training set size. Increasing the size of the training set clearly increases the accuracy, and the expansion of the training set to include not only more derivatives, but more alignments, would make the method more sound. The accuracy of the classifiers may also be improved when additional factors, such as structural and physico-chemical data, are considered. The development of this type of classifier has significant applications in evaluating engineered antibodies, and may be adapted for evaluating engineered proteins in general.
所有多肽骨架都有可能形成淀粉样纤维,这与许多退行性疾病有关。然而,在生理条件下淀粉样变性实际发生的可能性在很大程度上取决于蛋白质的氨基酸组成。我们探索使用朴素贝叶斯分类器和加权决策树来预测免疫球蛋白序列的淀粉样变性。
基于 143 个淀粉样蛋白序列生成的贝叶斯分类器的 143 个留一交叉验证的平均准确率为 60.84%。这与由 103 个 AM 和 28 个非淀粉样变性序列组成的独立测试集的平均准确率 61.15%一致。当训练集通过独立测试集扩充时,留一交叉验证准确率提高到 81.08%。相比之下,使用决策树获得的独立测试集的平均分类准确率为 78.64%。使用贝叶斯分类器,非淀粉样变性序列的平均留一交叉验证准确率在 74.05%到 77.24%之间,具体取决于训练集的大小。独立测试集的准确率为 89%。对于决策树,非淀粉样变性的预测准确率为 75.00%。
这项探索性研究表明,这两种分类方法都有可能对序列的淀粉样变性提供直接预测。然而,满足本研究前提的可用序列数量有限,因此小于理想的训练集大小。增加训练集的大小显然会提高准确性,并且将训练集扩展到不仅包括更多的衍生物,还包括更多的比对,将使该方法更加完善。当考虑到其他因素,如结构和物理化学数据时,分类器的准确性也可能得到提高。这种分类器的开发在评估工程抗体方面具有重要应用,并且可以适应一般的工程蛋白评估。