Institute of Biomedical Studies.
Department of Computer Science.
Bioinformatics. 2018 May 1;34(9):1481-1487. doi: 10.1093/bioinformatics/btx823.
Classification by supervised machine learning greatly facilitates the annotation of protein characteristics from their primary sequence. However, the feature generation step in this process requires detailed knowledge of attributes used to classify the proteins. Lack of this knowledge risks the selection of irrelevant features, resulting in a faulty model. In this study, we introduce a supervised protein classification method with a novel means of automating the work-intensive feature generation step via a Natural Language Processing (NLP)-dependent model, using a modified combination of n-grams and skip-grams (m-NGSG).
A meta-comparison of cross-validation accuracy with twelve training datasets from nine different published studies demonstrates a consistent increase in accuracy of m-NGSG when compared to contemporary classification and feature generation models. We expect this model to accelerate the classification of proteins from primary sequence data and increase the accessibility of protein characteristic prediction to a broader range of scientists.
m-NGSG is freely available at Bitbucket: https://bitbucket.org/sm_islam/mngsg/src. A web server is available at watson.ecs.baylor.edu/ngsg.
Supplementary data are available at Bioinformatics online.
通过有监督的机器学习进行分类极大地促进了从蛋白质的一级序列中注释蛋白质特性。然而,在这个过程中,特征生成步骤需要对用于对蛋白质进行分类的属性有详细的了解。缺乏这些知识会有选择不相关特征的风险,从而导致模型出现错误。在这项研究中,我们介绍了一种有监督的蛋白质分类方法,该方法通过依赖自然语言处理 (NLP) 的模型以新颖的方式自动执行繁琐的特征生成步骤,使用经过修改的 n 元组和跳字 (m-NGSG) 的组合。
通过与来自九个不同已发表研究的十二个训练数据集的交叉验证准确性的元比较,与当代分类和特征生成模型相比,m-NGSG 的准确性一致提高。我们希望该模型能够加速从原始序列数据对蛋白质进行分类,并使更多的科学家能够更容易地预测蛋白质的特性。
m-NGSG 可在 Bitbucket 上免费获得:https://bitbucket.org/sm_islam/mngsg/src。一个网络服务器可在 watson.ecs.baylor.edu/ngsg 上使用。
补充数据可在 Bioinformatics 在线获得。