• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

语料库重构:一项可行性研究。

Corpus refactoring: a feasibility study.

作者信息

Johnson Helen L, Baumgartner William A, Krallinger Martin, Cohen K Bretonnel, Hunter Lawrence

机构信息

Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, CO, USA.

出版信息

J Biomed Discov Collab. 2007 Sep 13;2:4. doi: 10.1186/1747-5333-2-4.

DOI:10.1186/1747-5333-2-4
PMID:17854502
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2072937/
Abstract

BACKGROUND

Most biomedical corpora have not been used outside of the lab that created them, despite the fact that the availability of the gold-standard evaluation data that they provide is one of the rate-limiting factors for the progress of biomedical text mining. Data suggest that one major factor affecting the use of a corpus outside of its home laboratory is the format in which it is distributed. This paper tests the hypothesis that corpus refactoring - changing the format of a corpus without altering its semantics - is a feasible goal, namely that it can be accomplished with a semi-automatable process and in a time-effcient way. We used simple text processing methods and limited human validation to convert the Protein Design Group corpus into two new formats: WordFreak and embedded XML. We tracked the total time expended and the success rates of the automated steps.

RESULTS

The refactored corpus is available for download at the BioNLP SourceForge website http://bionlp.sourceforge.net. The total time expended was just over three person-weeks, consisting of about 102 hours of programming time (much of which is one-time development cost) and 20 hours of manual validation of automatic outputs. Additionally, the steps required to refactor any corpus are presented.

CONCLUSION

We conclude that refactoring of publicly available corpora is a technically and economically feasible method for increasing the usage of data already available for evaluating biomedical language processing systems.

摘要

背景

尽管大多数生物医学语料库所提供的金标准评估数据的可用性是生物医学文本挖掘进展的限速因素之一,但这些语料库大多仅在创建它们的实验室内部使用。数据表明,影响语料库在其创建实验室之外使用的一个主要因素是其分发格式。本文检验了这样一个假设,即语料库重构——在不改变语义的情况下改变语料库的格式——是一个可行的目标,也就是说,可以通过一个半自动化的过程高效地完成。我们使用简单的文本处理方法并进行有限的人工验证,将蛋白质设计集团语料库转换为两种新格式:WordFreak和嵌入式XML。我们跟踪了所花费的总时间以及自动化步骤的成功率。

结果

重构后的语料库可在生物自然语言处理SourceForge网站http://bionlp.sourceforge.net上下载。所花费的总时间略超过三个人周,包括约102小时的编程时间(其中大部分是一次性开发成本)和20小时对自动输出的人工验证。此外,还介绍了重构任何语料库所需的步骤。

结论

我们得出结论,对公开可用的语料库进行重构是一种在技术和经济上可行的方法,可增加已用于评估生物医学语言处理系统的数据的使用量。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e82/2072937/e2628767fc34/1747-5333-2-4-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e82/2072937/6f2d622d28be/1747-5333-2-4-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e82/2072937/1835ac1ed2da/1747-5333-2-4-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e82/2072937/e2628767fc34/1747-5333-2-4-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e82/2072937/6f2d622d28be/1747-5333-2-4-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e82/2072937/1835ac1ed2da/1747-5333-2-4-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e82/2072937/e2628767fc34/1747-5333-2-4-3.jpg

相似文献

1
Corpus refactoring: a feasibility study.语料库重构:一项可行性研究。
J Biomed Discov Collab. 2007 Sep 13;2:4. doi: 10.1186/1747-5333-2-4.
2
Gold-standard ontology-based anatomical annotation in the CRAFT Corpus.CRAFT语料库中基于金标准本体的解剖学标注
Database (Oxford). 2017 Jan 1;2017. doi: 10.1093/database/bax087.
3
Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora.在生物医学文献中查找缩写:三个生物医学信息交换格式(BioC)兼容模块和四个BioC格式语料库。
Database (Oxford). 2014 Jun 9;2014. doi: 10.1093/database/bau044. Print 2014.
4
Auto-CORPus: A Natural Language Processing Tool for Standardizing and Reusing Biomedical Literature.自动语料库:一种用于规范和复用生物医学文献的自然语言处理工具。
Front Digit Health. 2022 Feb 15;4:788124. doi: 10.3389/fdgth.2022.788124. eCollection 2022.
5
Empirical data on corpus design and usage in biomedical natural language processing.生物医学自然语言处理中语料库设计与使用的实证数据。
AMIA Annu Symp Proc. 2005;2005:156-60.
6
An open-source framework for large-scale, flexible evaluation of biomedical text mining systems.一个用于大规模、灵活评估生物医学文本挖掘系统的开源框架。
J Biomed Discov Collab. 2008 Jan 29;3:1. doi: 10.1186/1747-5333-3-1.
7
Microtask crowdsourcing for disease mention annotation in PubMed abstracts.用于在PubMed摘要中进行疾病提及标注的微任务众包。
Pac Symp Biocomput. 2015:282-93.
8
Automated Classification of Selected Data Elements from Free-text Diagnostic Reports for Clinical Research.用于临床研究的自由文本诊断报告中选定数据元素的自动分类
Methods Inf Med. 2016 Aug 5;55(4):373-80. doi: 10.3414/ME15-02-0019. Epub 2016 Jul 13.
9
A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC.用于生物医学概念识别的多语言金标准语料库:Mantra GSC。
J Am Med Inform Assoc. 2015 Sep;22(5):948-56. doi: 10.1093/jamia/ocv037. Epub 2015 May 6.
10
Concept recognition for extracting protein interaction relations from biomedical text.从生物医学文本中提取蛋白质相互作用关系的概念识别
Genome Biol. 2008;9 Suppl 2(Suppl 2):S9. doi: 10.1186/gb-2008-9-s2-s9. Epub 2008 Sep 1.

引用本文的文献

1
Benchmarking infrastructure for mutation text mining.用于突变文本挖掘的基准测试基础设施。
J Biomed Semantics. 2014 Feb 25;5(1):11. doi: 10.1186/2041-1480-5-11.
2
BioC: a minimalist approach to interoperability for biomedical text processing.BioC:一种用于生物医学文本处理的最小互操作方法。
Database (Oxford). 2013 Sep 18;2013:bat064. doi: 10.1093/database/bat064. Print 2013.
3
Pooling annotated corpora for clinical concept extraction.合并带注释语料库用于临床概念提取。

本文引用的文献

1
Overview of BioCreative II gene mention recognition.生物创意II基因提及识别概述。
Genome Biol. 2008;9 Suppl 2(Suppl 2):S2. doi: 10.1186/gb-2008-9-s2-s2. Epub 2008 Sep 1.
2
BioInfer: a corpus for information extraction in the biomedical domain.生物推理(BioInfer):一个用于生物医学领域信息提取的语料库。
BMC Bioinformatics. 2007 Feb 9;8:50. doi: 10.1186/1471-2105-8-50.
3
Mining of relations between proteins over biomedical scientific literature using a deep-linguistic approach.使用深度语言方法挖掘生物医学科学文献中蛋白质之间的关系。
J Biomed Semantics. 2013 Jan 8;4(1):3. doi: 10.1186/2041-1480-4-3.
4
Investigating heterogeneous protein annotations toward cross-corpora utilization.研究跨语料库利用的异构蛋白质注释。
BMC Bioinformatics. 2009 Dec 9;10:403. doi: 10.1186/1471-2105-10-403.
5
A realistic assessment of methods for extracting gene/protein interactions from free text.从自由文本中提取基因/蛋白质相互作用方法的现实评估。
BMC Bioinformatics. 2009 Jul 28;10:233. doi: 10.1186/1471-2105-10-233.
6
Concept recognition for extracting protein interaction relations from biomedical text.从生物医学文本中提取蛋白质相互作用关系的概念识别
Genome Biol. 2008;9 Suppl 2(Suppl 2):S9. doi: 10.1186/gb-2008-9-s2-s9. Epub 2008 Sep 1.
7
Comparative analysis of five protein-protein interaction corpora.五个蛋白质-蛋白质相互作用语料库的比较分析。
BMC Bioinformatics. 2008 Apr 11;9 Suppl 3(Suppl 3):S6. doi: 10.1186/1471-2105-9-S3-S6.
Artif Intell Med. 2007 Feb;39(2):127-36. doi: 10.1016/j.artmed.2006.08.005. Epub 2006 Oct 18.
4
Empirical data on corpus design and usage in biomedical natural language processing.生物医学自然语言处理中语料库设计与使用的实证数据。
AMIA Annu Symp Proc. 2005;2005:156-60.
5
GENETAG: a tagged corpus for gene/protein named entity recognition.GENETAG:一个用于基因/蛋白质命名实体识别的带标注语料库。
BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S3. doi: 10.1186/1471-2105-6-S1-S3. Epub 2005 May 24.
6
BioCreAtIvE task 1A: gene mention finding evaluation.生物创意任务1A:基因提及发现评估。
BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S2. doi: 10.1186/1471-2105-6-S1-S2. Epub 2005 May 24.
7
Evaluation of BioCreAtIvE assessment of task 2.生物创意任务2评估的评价
BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S16. doi: 10.1186/1471-2105-6-S1-S16. Epub 2005 May 24.
8
Overview of BioCreAtIvE task 1B: normalized gene lists.生物创意任务1B概述:标准化基因列表。
BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S11. doi: 10.1186/1471-2105-6-S1-S11. Epub 2005 May 24.
9
Comparative experiments on learning information extractors for proteins and their interactions.蛋白质及其相互作用的学习信息提取器的比较实验。
Artif Intell Med. 2005 Feb;33(2):139-55. doi: 10.1016/j.artmed.2004.07.016.
10
Bio-medical entity extraction using support vector machines.使用支持向量机进行生物医学实体提取。
Artif Intell Med. 2005 Feb;33(2):125-37. doi: 10.1016/j.artmed.2004.07.019.