Suppr超能文献

闭环:利用监督式基因本体分类从文献到蛋白质注释

Closing the loop: from paper to protein annotation using supervised Gene Ontology classification.

作者信息

Gobeill Julien, Pasche Emilie, Vishnyakova Dina, Ruch Patrick

机构信息

BiTeM group, University of Applied Sciences-HEG, Library and Information Sciences, Rte de Drize 7, 1227 Geneva, Switzerland, Division of Medical Information Sciences, University and Hospitals of Geneva, Geneva, Switzerland and SIBtex group, SIB Swiss Institute of Bioinformatics, Rue Michel-Servet 1, 1206 Geneva, Switzerland BiTeM group, University of Applied Sciences-HEG, Library and Information Sciences, Rte de Drize 7, 1227 Geneva, Switzerland, Division of Medical Information Sciences, University and Hospitals of Geneva, Geneva, Switzerland and SIBtex group, SIB Swiss Institute of Bioinformatics, Rue Michel-Servet 1, 1206 Geneva, Switzerland

BiTeM group, University of Applied Sciences-HEG, Library and Information Sciences, Rte de Drize 7, 1227 Geneva, Switzerland, Division of Medical Information Sciences, University and Hospitals of Geneva, Geneva, Switzerland and SIBtex group, SIB Swiss Institute of Bioinformatics, Rue Michel-Servet 1, 1206 Geneva, Switzerland BiTeM group, University of Applied Sciences-HEG, Library and Information Sciences, Rte de Drize 7, 1227 Geneva, Switzerland, Division of Medical Information Sciences, University and Hospitals of Geneva, Geneva, Switzerland and SIBtex group, SIB Swiss Institute of Bioinformatics, Rue Michel-Servet 1, 1206 Geneva, Switzerland.

出版信息

Database (Oxford). 2014 Sep 4;2014. doi: 10.1093/database/bau088. Print 2014.

Abstract

UNLABELLED

Gene function curation of the literature with Gene Ontology (GO) concepts is one particularly time-consuming task in genomics, and the help from bioinformatics is highly requested to keep up with the flow of publications. In 2004, the first BioCreative challenge already designed a task of automatic GO concepts assignment from a full text. At this time, results were judged far from reaching the performances required by real curation workflows. In particular, supervised approaches produced the most disappointing results because of lack of training data. Ten years later, the available curation data have massively grown. In 2013, the BioCreative IV GO task revisited the automatic GO assignment task. For this issue, we investigated the power of our supervised classifier, GOCat. GOCat computes similarities between an input text and already curated instances contained in a knowledge base to infer GO concepts. The subtask A consisted in selecting GO evidence sentences for a relevant gene in a full text. For this, we designed a state-of-the-art supervised statistical approach, using a naïve Bayes classifier and the official training set, and obtained fair results. The subtask B consisted in predicting GO concepts from the previous output. For this, we applied GOCat and reached leading results, up to 65% for hierarchical recall in the top 20 outputted concepts. Contrary to previous competitions, machine learning has this time outperformed standard dictionary-based approaches. Thanks to BioCreative IV, we were able to design a complete workflow for curation: given a gene name and a full text, this system is able to select evidence sentences for curation and to deliver highly relevant GO concepts. Contrary to previous competitions, machine learning this time outperformed dictionary-based systems. Observed performances are sufficient for being used in a real semiautomatic curation workflow. GOCat is available at http://eagl.unige.ch/GOCat/.

DATABASE URL

http://eagl.unige.ch/GOCat4FT/.

摘要

未标注

利用基因本体论(GO)概念对文献进行基因功能注释是基因组学中一项特别耗时的任务,因此迫切需要生物信息学的帮助来跟上出版物的更新速度。2004年,第一届生物创意挑战赛就已经设计了一项从全文自动分配GO概念的任务。当时,结果被判定远未达到实际注释工作流程所需的性能。特别是,由于缺乏训练数据,监督方法产生了最令人失望的结果。十年后,可用的注释数据大量增加。2013年,生物创意IV GO任务重新审视了自动GO分配任务。针对这个问题,我们研究了我们的监督分类器GOCat的能力。GOCat通过计算输入文本与知识库中已注释实例之间的相似度来推断GO概念。子任务A包括在全文中为相关基因选择GO证据句子。为此,我们设计了一种最先进的监督统计方法,使用朴素贝叶斯分类器和官方训练集,并取得了不错的结果。子任务B包括根据先前的输出预测GO概念。为此,我们应用了GOCat并取得了领先的结果,在前20个输出概念中的层次召回率高达65%。与之前的竞赛不同,这次机器学习的表现优于基于标准词典的方法。多亏了生物创意IV,我们能够设计出一个完整的注释工作流程:给定一个基因名称和一篇全文,这个系统能够选择用于注释的证据句子并提供高度相关的GO概念。与之前的竞赛不同,这次机器学习的表现优于基于词典的系统。观察到的性能足以用于实际的半自动注释工作流程。GOCat可在http://eagl.unige.ch/GOCat/获取。

数据库网址

http://eagl.unige.ch/GOCat4FT/

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d635/4154439/a4228f935698/bau088f1p.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验