Suppr超能文献

初步评估 CellFinder 文献整理管道在肾脏细胞和解剖部位基因表达中的应用。

Preliminary evaluation of the CellFinder literature curation pipeline for gene expression in kidney cells and anatomical parts.

机构信息

Humboldt-Universität zu Berlin, Knowledge Management in Bioinformatics, Berlin, 10099, Germany.

出版信息

Database (Oxford). 2013 Apr 18;2013:bat020. doi: 10.1093/database/bat020. Print 2013.

Abstract

Biomedical literature curation is the process of automatically and/or manually deriving knowledge from scientific publications and recording it into specialized databases for structured delivery to users. It is a slow, error-prone, complex, costly and, yet, highly important task. Previous experiences have proven that text mining can assist in its many phases, especially, in triage of relevant documents and extraction of named entities and biological events. Here, we present the curation pipeline of the CellFinder database, a repository of cell research, which includes data derived from literature curation and microarrays to identify cell types, cell lines, organs and so forth, and especially patterns in gene expression. The curation pipeline is based on freely available tools in all text mining steps, as well as the manual validation of extracted data. Preliminary results are presented for a data set of 2376 full texts from which >4500 gene expression events in cell or anatomical part have been extracted. Validation of half of this data resulted in a precision of ~50% of the extracted data, which indicates that we are on the right track with our pipeline for the proposed task. However, evaluation of the methods shows that there is still room for improvement in the named-entity recognition and that a larger and more robust corpus is needed to achieve a better performance for event extraction. Database URL: http://www.cellfinder.org/

摘要

生物医学文献整理是指从科学出版物中自动和/或手动提取知识,并将其记录到专门的数据库中,以便向用户进行结构化传递。这是一项缓慢、易错、复杂、昂贵但又非常重要的任务。以往的经验表明,文本挖掘可以辅助完成其许多阶段的工作,尤其是在相关文献的分类和命名实体及生物事件的提取方面。在此,我们展示了 CellFinder 数据库的整理流程,该数据库是一个细胞研究存储库,其中包括从文献整理和微阵列中提取的数据,用于识别细胞类型、细胞系、器官等,特别是基因表达模式。整理流程基于所有文本挖掘步骤中免费提供的工具,以及对提取数据的手动验证。我们为一个包含 2376 篇全文的数据集呈现了初步结果,从中提取了超过 4500 个细胞或解剖部位的基因表达事件。对其中一半数据的验证结果表明,我们的提取数据的准确率约为 50%,这表明我们的管道在这个任务上是正确的。然而,对这些方法的评估表明,在命名实体识别方面仍有改进的空间,并且需要更大、更稳健的语料库,才能在事件提取方面取得更好的性能。数据库网址:http://www.cellfinder.org/

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5e44/3629873/a59d81684cbe/bat020f1p.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验