Department of Biochemistry and Molecular Biology, Oklahoma State University, Stillwater, OK 74078, USA.
BMC Bioinformatics. 2012;13 Suppl 15(Suppl 15):S9. doi: 10.1186/1471-2105-13-S15-S9. Epub 2012 Sep 11.
Members of the phylum Proteobacteria are most prominent among bacteria causing plant diseases that result in a diminution of the quantity and quality of food produced by agriculture. To ameliorate these losses, there is a need to identify infections in early stages. Recent developments in next generation nucleic acid sequencing and mass spectrometry open the door to screening plants by the sequences of their macromolecules. Such an approach requires the ability to recognize the organismal origin of unknown DNA or peptide fragments. There are many ways to approach this problem but none have emerged as the best protocol. Here we attempt a systematic way to determine organismal origins of peptides by using a machine learning algorithm. The algorithm that we implement is a Support Vector Machine (SVM).
The amino acid compositions of proteobacterial proteins were found to be different from those of plant proteins. We developed an SVM model based on amino acid and dipeptide compositions to distinguish between a proteobacterial protein and a plant protein. The amino acid composition (AAC) based SVM model had an accuracy of 92.44% with 0.85 Matthews correlation coefficient (MCC) while the dipeptide composition (DC) based SVM model had a maximum accuracy of 94.67% and 0.89 MCC. We also developed SVM models based on a hybrid approach (AAC and DC), which gave a maximum accuracy 94.86% and a 0.90 MCC. The models were tested on unseen or untrained datasets to assess their validity.
The results indicate that the SVM based on the AAC and DC hybrid approach can be used to distinguish proteobacterial from plant protein sequences.
在导致农业减产减质的植物病原菌中,变形菌门的细菌最为突出。为了减轻这些损失,有必要在早期发现感染。新一代核酸测序和质谱技术的发展为通过植物大分子序列筛选植物开辟了道路。这种方法需要能够识别未知 DNA 或肽片段的生物起源。有很多方法可以解决这个问题,但没有一种方法成为最佳方案。在这里,我们尝试使用机器学习算法系统地确定肽的生物起源。我们实现的算法是支持向量机(SVM)。
发现变形菌蛋白的氨基酸组成与植物蛋白的氨基酸组成不同。我们开发了一种基于氨基酸和二肽组成的 SVM 模型,以区分变形菌蛋白和植物蛋白。基于氨基酸组成(AAC)的 SVM 模型的准确性为 92.44%,马修斯相关系数(MCC)为 0.85,而基于二肽组成(DC)的 SVM 模型的最大准确性为 94.67%,MCC 为 0.89。我们还开发了基于混合方法(AAC 和 DC)的 SVM 模型,其最大准确性为 94.86%,MCC 为 0.90。这些模型在未见或未训练的数据集中进行了测试,以评估其有效性。
结果表明,基于 AAC 和 DC 混合方法的 SVM 可用于区分植物和变形菌蛋白序列。