利用机器学习工具辅助蛋白质数据库生物注释。

Using machine learning tools for protein database biocuration assistance.

机构信息

IDEAI Research Center, Universitat Politècnica de Catalunya, UPC BarcelonaTech, 08034, Barcelona, Spain.

Centro de Investigación Biomédica en Red en Bioingeniería, Biomateriales y Nanomedicina (CIBER-BBN), 08193, Cerdanyola del Vallès, Spain.

出版信息

Sci Rep. 2018 Jul 5;8(1):10148. doi: 10.1038/s41598-018-28330-z.

DOI:10.1038/s41598-018-28330-z

PMID:29977071

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6033909/

Abstract

Biocuration in the omics sciences has become paramount, as research in these fields rapidly evolves towards increasingly data-dependent models. As a result, the management of web-accessible publicly-available databases becomes a central task in biological knowledge dissemination. One relevant challenge for biocurators is the unambiguous identification of biological entities. In this study, we illustrate the adequacy of machine learning methods as biocuration assistance tools using a publicly available protein database as an example. This database contains information on G Protein-Coupled Receptors (GPCRs), which are part of eukaryotic cell membranes and relevant in cell communication as well as major drug targets in pharmacology. These receptors are characterized according to subtype labels. Previous analysis of this database provided evidence that some of the receptor sequences could be affected by a case of label noise, as they appeared to be too consistently misclassified by machine learning methods. Here, we extend our analysis to recent and quite substantially modified new versions of the database and reveal their now extremely accurate labeling using several machine learning models and different transformations of the unaligned sequences. These findings support the adequacy of our proposed method to identify problematic labeling cases as a tool for database biocuration.

摘要

在组学科学中，生物注释变得至关重要，因为这些领域的研究正迅速朝着越来越依赖数据的模型发展。因此，管理可通过网络访问的公共可用数据库成为生物知识传播的核心任务。生物注释人员面临的一个相关挑战是明确识别生物实体。在本研究中，我们使用一个公开的蛋白质数据库作为示例，说明了机器学习方法作为生物注释辅助工具的充分性。该数据库包含 G 蛋白偶联受体 (GPCR) 的信息，GPCR 是真核细胞膜的一部分，在细胞通讯中具有重要作用，也是药理学中的主要药物靶点。这些受体根据亚型标签进行特征描述。对该数据库的先前分析提供了证据，表明一些受体序列可能受到标签噪声的影响，因为它们似乎被机器学习方法过于一致地错误分类。在这里，我们将分析扩展到数据库的最近和经过相当大修改的新版本，并使用几种机器学习模型和未对齐序列的不同变换来揭示它们现在极其准确的标记。这些发现支持了我们提出的方法作为数据库生物注释工具来识别有问题的标记案例的充分性。