Humboldt-Universität zu Berlin, Knowledge Management in Bioinformatics, Berlin, 10099, Germany.
Database (Oxford). 2013 Apr 18;2013:bat020. doi: 10.1093/database/bat020. Print 2013.
Biomedical literature curation is the process of automatically and/or manually deriving knowledge from scientific publications and recording it into specialized databases for structured delivery to users. It is a slow, error-prone, complex, costly and, yet, highly important task. Previous experiences have proven that text mining can assist in its many phases, especially, in triage of relevant documents and extraction of named entities and biological events. Here, we present the curation pipeline of the CellFinder database, a repository of cell research, which includes data derived from literature curation and microarrays to identify cell types, cell lines, organs and so forth, and especially patterns in gene expression. The curation pipeline is based on freely available tools in all text mining steps, as well as the manual validation of extracted data. Preliminary results are presented for a data set of 2376 full texts from which >4500 gene expression events in cell or anatomical part have been extracted. Validation of half of this data resulted in a precision of ~50% of the extracted data, which indicates that we are on the right track with our pipeline for the proposed task. However, evaluation of the methods shows that there is still room for improvement in the named-entity recognition and that a larger and more robust corpus is needed to achieve a better performance for event extraction. Database URL: http://www.cellfinder.org/
生物医学文献整理是指从科学出版物中自动和/或手动提取知识,并将其记录到专门的数据库中,以便向用户进行结构化传递。这是一项缓慢、易错、复杂、昂贵但又非常重要的任务。以往的经验表明,文本挖掘可以辅助完成其许多阶段的工作,尤其是在相关文献的分类和命名实体及生物事件的提取方面。在此,我们展示了 CellFinder 数据库的整理流程,该数据库是一个细胞研究存储库,其中包括从文献整理和微阵列中提取的数据,用于识别细胞类型、细胞系、器官等,特别是基因表达模式。整理流程基于所有文本挖掘步骤中免费提供的工具,以及对提取数据的手动验证。我们为一个包含 2376 篇全文的数据集呈现了初步结果,从中提取了超过 4500 个细胞或解剖部位的基因表达事件。对其中一半数据的验证结果表明,我们的提取数据的准确率约为 50%,这表明我们的管道在这个任务上是正确的。然而,对这些方法的评估表明,在命名实体识别方面仍有改进的空间,并且需要更大、更稳健的语料库,才能在事件提取方面取得更好的性能。数据库网址:http://www.cellfinder.org/