Suppr超能文献

学术信息提取将借助美国国立医学图书馆的生物医学文献数据库(PMC)实现巨大飞跃。

Scholarly Information Extraction Is Going to Make a Quantum Leap with PubMed Central (PMC).

作者信息

Matthies Franz, Hahn Udo

机构信息

Jena University Language & Information Engineering (JULIE) Lab, Friedrich-Schiller-Universität Jena, Jena 07743, Germany.

出版信息

Stud Health Technol Inform. 2017;245:521-525.

Abstract

With the increasing availability of complete full texts (journal articles), rather than their surrogates (titles, abstracts), as resources for text analytics, entirely new opportunities arise for information extraction and text mining from scholarly publications. Yet, we gathered evidence that a range of problems are encountered for full-text processing when biomedical text analytics simply reuse existing NLP pipelines which were developed on the basis of abstracts (rather than full texts). We conducted experiments with four different relation extraction engines all of which were top performers in previous BioNLP Event Extraction Challenges. We found that abstract-trained engines loose up to 6.6% F-score points when run on full-text data. Hence, the reuse of existing abstract-based NLP software in a full-text scenario is considered harmful because of heavy performance losses. Given the current lack of annotated full-text resources to train on, our study quantifies the price paid for this short cut.

摘要

随着完整全文(期刊文章)而非其替代物(标题、摘要)作为文本分析资源的可用性不断提高,从学术出版物中进行信息提取和文本挖掘出现了全新的机会。然而,我们收集到的证据表明,当生物医学文本分析简单地复用基于摘要(而非全文)开发的现有自然语言处理(NLP)管道时,在全文处理中会遇到一系列问题。我们使用四个不同的关系提取引擎进行了实验,所有这些引擎在之前的生物NLP事件提取挑战赛中都是佼佼者。我们发现,在全文数据上运行时,基于摘要训练的引擎F值会损失高达6.6个百分点。因此,在全文场景中复用现有的基于摘要的NLP软件被认为是有害的,因为性能会大幅下降。鉴于目前缺乏用于训练的带注释的全文资源,我们的研究量化了走这条捷径所付出的代价。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验