使用改进的 n 元语法和 skip-grams 进行蛋白质分类。

Protein classification using modified n-grams and skip-grams.

机构信息

Institute of Biomedical Studies.

Department of Computer Science.

出版信息

Bioinformatics. 2018 May 1;34(9):1481-1487. doi: 10.1093/bioinformatics/btx823.

DOI:10.1093/bioinformatics/btx823

PMID:29309523

Abstract

MOTIVATION

Classification by supervised machine learning greatly facilitates the annotation of protein characteristics from their primary sequence. However, the feature generation step in this process requires detailed knowledge of attributes used to classify the proteins. Lack of this knowledge risks the selection of irrelevant features, resulting in a faulty model. In this study, we introduce a supervised protein classification method with a novel means of automating the work-intensive feature generation step via a Natural Language Processing (NLP)-dependent model, using a modified combination of n-grams and skip-grams (m-NGSG).

RESULTS

A meta-comparison of cross-validation accuracy with twelve training datasets from nine different published studies demonstrates a consistent increase in accuracy of m-NGSG when compared to contemporary classification and feature generation models. We expect this model to accelerate the classification of proteins from primary sequence data and increase the accessibility of protein characteristic prediction to a broader range of scientists.

AVAILABILITY AND IMPLEMENTATION

m-NGSG is freely available at Bitbucket: https://bitbucket.org/sm_islam/mngsg/src. A web server is available at watson.ecs.baylor.edu/ngsg.

CONTACT

erich_baker@baylor.edu.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

通过有监督的机器学习进行分类极大地促进了从蛋白质的一级序列中注释蛋白质特性。然而，在这个过程中，特征生成步骤需要对用于对蛋白质进行分类的属性有详细的了解。缺乏这些知识会有选择不相关特征的风险，从而导致模型出现错误。在这项研究中，我们介绍了一种有监督的蛋白质分类方法，该方法通过依赖自然语言处理 (NLP) 的模型以新颖的方式自动执行繁琐的特征生成步骤，使用经过修改的 n 元组和跳字 (m-NGSG) 的组合。

结果

通过与来自九个不同已发表研究的十二个训练数据集的交叉验证准确性的元比较，与当代分类和特征生成模型相比，m-NGSG 的准确性一致提高。我们希望该模型能够加速从原始序列数据对蛋白质进行分类，并使更多的科学家能够更容易地预测蛋白质的特性。

可用性和实现

m-NGSG 可在 Bitbucket 上免费获得：https://bitbucket.org/sm_islam/mngsg/src。一个网络服务器可在 watson.ecs.baylor.edu/ngsg 上使用。

联系人

erich_baker@baylor.edu。

补充信息

补充数据可在 Bioinformatics 在线获得。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

使用改进的 n 元语法和 skip-grams 进行蛋白质分类。

Protein classification using modified n-grams and skip-grams.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

CONTACT

SUPPLEMENTARY INFORMATION

动机

结果

可用性和实现

联系人

补充信息

相似文献

引用本文的文献

使用改进的 n 元语法和 skip-grams 进行蛋白质分类。

Protein classification using modified n-grams and skip-grams.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

CONTACT

SUPPLEMENTARY INFORMATION

动机

结果

可用性和实现

联系人

补充信息

相似文献

引用本文的文献