Suppr超能文献

蛋白质亚细胞定位的半自动管理:一种基于文本挖掘的基因本体论(GO)细胞组分管理方法。

Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation.

作者信息

Van Auken Kimberly, Jaffery Joshua, Chan Juancarlos, Müller Hans-Michael, Sternberg Paul W

机构信息

Division of Biology, California Institute of Technology, Pasadena, CA 91125, USA.

出版信息

BMC Bioinformatics. 2009 Jul 21;10:228. doi: 10.1186/1471-2105-10-228.

Abstract

BACKGROUND

Manual curation of experimental data from the biomedical literature is an expensive and time-consuming endeavor. Nevertheless, most biological knowledge bases still rely heavily on manual curation for data extraction and entry. Text mining software that can semi- or fully automate information retrieval from the literature would thus provide a significant boost to manual curation efforts.

RESULTS

We employ the Textpresso category-based information retrieval and extraction system (http://www.textpresso.org), developed by WormBase to explore how Textpresso might improve the efficiency with which we manually curate C. elegans proteins to the Gene Ontology's Cellular Component Ontology. Using a training set of sentences that describe results of localization experiments in the published literature, we generated three new curation task-specific categories (Cellular Components, Assay Terms, and Verbs) containing words and phrases associated with reports of experimentally determined subcellular localization. We compared the results of manual curation to that of Textpresso queries that searched the full text of articles for sentences containing terms from each of the three new categories plus the name of a previously uncurated C. elegans protein, and found that Textpresso searches identified curatable papers with recall and precision rates of 79.1% and 61.8%, respectively (F-score of 69.5%), when compared to manual curation. Within those documents, Textpresso identified relevant sentences with recall and precision rates of 30.3% and 80.1% (F-score of 44.0%). From returned sentences, curators were able to make 66.2% of all possible experimentally supported GO Cellular Component annotations with 97.3% precision (F-score of 78.8%). Measuring the relative efficiencies of Textpresso-based versus manual curation we find that Textpresso has the potential to increase curation efficiency by at least 8-fold, and perhaps as much as 15-fold, given differences in individual curatorial speed.

CONCLUSION

Textpresso is an effective tool for improving the efficiency of manual, experimentally based curation. Incorporating a Textpresso-based Cellular Component curation pipeline at WormBase has allowed us to transition from strictly manual curation of this data type to a more efficient pipeline of computer-assisted validation. Continued development of curation task-specific Textpresso categories will provide an invaluable resource for genomics databases that rely heavily on manual curation.

摘要

背景

从生物医学文献中人工整理实验数据是一项昂贵且耗时的工作。然而,大多数生物学知识库在数据提取和录入方面仍然严重依赖人工整理。因此,能够半自动或全自动从文献中检索信息的文本挖掘软件将极大地推动人工整理工作。

结果

我们使用由WormBase开发的基于Textpresso分类的信息检索和提取系统(http://www.textpresso.org),来探索Textpresso如何提高我们将秀丽隐杆线虫蛋白质人工整理到基因本体论的细胞成分本体论中的效率。我们使用一组描述已发表文献中定位实验结果的句子作为训练集,生成了三个新的特定于整理任务的类别(细胞成分、检测术语和动词),其中包含与实验确定的亚细胞定位报告相关的单词和短语。我们将人工整理的结果与Textpresso查询的结果进行了比较,Textpresso查询在文章全文中搜索包含这三个新类别中的术语以及一个之前未整理的秀丽隐杆线虫蛋白质名称的句子,结果发现与人工整理相比,Textpresso搜索识别出可整理论文的召回率和精确率分别为79.1%和61.8%(F值为69.5%)。在这些文档中,Textpresso识别出相关句子的召回率和精确率分别为30.3%和80.1%(F值为44.0%)。从返回的句子中,整理人员能够以97.3%的精确率做出所有可能的由实验支持的基因本体论细胞成分注释的66.2%(F值为78.8%)。通过衡量基于Textpresso与人工整理的相对效率,我们发现鉴于个体整理速度的差异,Textpresso有潜力将整理效率提高至少8倍,甚至可能高达15倍。

结论

Textpresso是提高基于实验的人工整理效率的有效工具。在WormBase纳入基于Textpresso的细胞成分整理流程,使我们能够从严格的人工整理这种数据类型转变为更高效的计算机辅助验证流程。持续开发特定于整理任务的Textpresso类别将为严重依赖人工整理的基因组数据库提供宝贵资源。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/787a/2719631/6cbaa2aa1253/1471-2105-10-228-1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验