• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

自然语言处理应用中的软件测试与质量保证评估以及一种受语言启发的改进方法。

Assessment of software testing and quality assurance in natural language processing applications and a linguistically inspired approach to improving it.

作者信息

Cohen K Bretonnel, Hunter Lawrence E, Palmer Martha

机构信息

Computational Bioscience Program, University of Colorado School of Medicine, Aurora, Colorado, USA; Department of Linguistics, University of Colorado at Boulder, Boulder, Colorado, USA.

出版信息

Trust Eternal Syst Via Evol Softw Data Knowl (2012). 2013;379:77-90. doi: 10.1007/978-3-642-45260-4_6.

DOI:10.1007/978-3-642-45260-4_6
PMID:34308448
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8300901/
Abstract

Significant progress has been made in addressing the scientific challenges of biomedical text mining. However, the transition from a demonstration of scientific progress to the production of tools on which a broader community can rely requires that fundamental software engineering requirements be addressed. In this paper we characterize the state of biomedical text mining software with respect to software testing and quality assurance. Biomedical natural language processing software was chosen because it frequently specifically claims to offer production-quality services, rather than just research prototypes. We examined twenty web sites offering a variety of text mining services. On each web site, we performed the most basic software test known to us and classified the results. Seven out of twenty web sites returned either bad results or the worst class of results in response to this simple test. We conclude that biomedical natural language processing tools require greater attention to software quality. We suggest a linguistically motivated approach to granular evaluation of natural language processing applications, and show how it can be used to detect performance errors of several systems and to predict overall performance on specific equivalence classes of inputs. We also assess the ability of linguistically-motivated test suites to provide good software testing, as compared to large corpora of naturally-occurring data. We measure code coverage and find that it is considerably higher when even small structured test suites are utilized than when large corpora are used.

摘要

在应对生物医学文本挖掘的科学挑战方面已取得重大进展。然而,从科学进展的展示过渡到生产出更广泛的群体可以依赖的工具,需要解决基本的软件工程要求。在本文中,我们针对软件测试和质量保证描述了生物医学文本挖掘软件的现状。之所以选择生物医学自然语言处理软件,是因为它经常特别宣称提供生产质量的服务,而不仅仅是研究原型。我们考察了提供各种文本挖掘服务的二十个网站。在每个网站上,我们进行了我们所知的最基本的软件测试并对结果进行分类。二十个网站中有七个在回应这个简单测试时返回了错误结果或最差等级的结果。我们得出结论,生物医学自然语言处理工具需要更加关注软件质量。我们提出一种基于语言学的方法来对自然语言处理应用进行粒度评估,并展示它如何用于检测多个系统的性能错误以及预测在特定等效输入类上的整体性能。与大量自然出现的数据语料库相比,我们还评估了基于语言学的测试套件提供良好软件测试的能力。我们测量代码覆盖率,发现即使使用小型结构化测试套件时的代码覆盖率也比使用大型语料库时高得多。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3499/8300901/aca96869363b/nihms-1641159-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3499/8300901/aca96869363b/nihms-1641159-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3499/8300901/aca96869363b/nihms-1641159-f0001.jpg

相似文献

1
Assessment of software testing and quality assurance in natural language processing applications and a linguistically inspired approach to improving it.自然语言处理应用中的软件测试与质量保证评估以及一种受语言启发的改进方法。
Trust Eternal Syst Via Evol Softw Data Knowl (2012). 2013;379:77-90. doi: 10.1007/978-3-642-45260-4_6.
2
Chapter 16: text mining for translational bioinformatics.第十六章:转化生物信息学中的文本挖掘。
PLoS Comput Biol. 2013 Apr;9(4):e1003044. doi: 10.1371/journal.pcbi.1003044. Epub 2013 Apr 25.
3
BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT:一种用于生物医学文本挖掘的预训练生物医学语言表示模型。
Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.
4
BioC: a minimalist approach to interoperability for biomedical text processing.BioC:一种用于生物医学文本处理的最小互操作方法。
Database (Oxford). 2013 Sep 18;2013:bat064. doi: 10.1093/database/bat064. Print 2013.
5
A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools.语料库全文期刊文章是一种强大的评估工具,可用于揭示生物医学自然语言处理工具性能的差异。
BMC Bioinformatics. 2012 Aug 17;13:207. doi: 10.1186/1471-2105-13-207.
6
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
7
Macromolecular crowding: chemistry and physics meet biology (Ascona, Switzerland, 10-14 June 2012).大分子拥挤现象:化学与物理邂逅生物学(瑞士阿斯科纳,2012年6月10日至14日)
Phys Biol. 2013 Aug;10(4):040301. doi: 10.1088/1478-3975/10/4/040301. Epub 2013 Aug 2.
8
On the Construction of Multilingual Corpora for Clinical Text Mining.关于用于临床文本挖掘的多语言语料库的构建
Stud Health Technol Inform. 2020 Jun 16;270:347-351. doi: 10.3233/SHTI200180.
9
Text Mining in Biomedical Domain with Emphasis on Document Clustering.生物医学领域中的文本挖掘,重点在于文档聚类
Healthc Inform Res. 2017 Jul;23(3):141-146. doi: 10.4258/hir.2017.23.3.141. Epub 2017 Jul 31.
10
BioVAE: a pre-trained latent variable language model for biomedical text mining.BioVAE:用于生物医学文本挖掘的预训练潜在变量语言模型。
Bioinformatics. 2022 Jan 12;38(3):872-874. doi: 10.1093/bioinformatics/btab702.

引用本文的文献

1
Towards Understanding the Generalization of Medical Text-to-SQL Models and Datasets.迈向理解医学文本到 SQL 模型和数据集的泛化。
AMIA Annu Symp Proc. 2024 Jan 11;2023:669-678. eCollection 2023.
2
Methodological Issues in Predicting Pediatric Epilepsy Surgery Candidates Through Natural Language Processing and Machine Learning.通过自然语言处理和机器学习预测小儿癫痫手术候选者的方法学问题
Biomed Inform Insights. 2016 May 22;8:11-8. doi: 10.4137/BII.S38308. eCollection 2016.

本文引用的文献

1
Efficient extraction of protein-protein interactions from full-text articles.从全文文章中高效提取蛋白质-蛋白质相互作用。
IEEE/ACM Trans Comput Biol Bioinform. 2010 Jul-Sep;7(3):481-94. doi: 10.1109/TCBB.2010.51.
2
Concept recognition for extracting protein interaction relations from biomedical text.从生物医学文本中提取蛋白质相互作用关系的概念识别
Genome Biol. 2008;9 Suppl 2(Suppl 2):S9. doi: 10.1186/gb-2008-9-s2-s9. Epub 2008 Sep 1.
3
OpenDMAP: an open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression.
OpenDMAP:一个开源的、由本体驱动的概念分析引擎,应用于捕获有关蛋白质转运、蛋白质相互作用和细胞类型特异性基因表达的知识。
BMC Bioinformatics. 2008 Jan 31;9:78. doi: 10.1186/1471-2105-9-78.
4
A fault model for ontology mapping, alignment, and linking systems.一种用于本体映射、对齐和链接系统的故障模型。
Pac Symp Biocomput. 2007:233-44.
5
Frontiers of biomedical text mining: current progress.生物医学文本挖掘前沿:当前进展
Brief Bioinform. 2007 Sep;8(5):358-75. doi: 10.1093/bib/bbm045. Epub 2007 Oct 30.
6
Retraction.撤回。
Science. 2006 Dec 22;314(5807):1875. doi: 10.1126/science.314.5807.1875b.
7
Scientific publishing. A scientist's nightmare: software problem leads to five retractions.科学出版。科学家的噩梦:软件问题导致五篇论文被撤回。
Science. 2006 Dec 22;314(5807):1856-7. doi: 10.1126/science.314.5807.1856.
8
GENETAG: a tagged corpus for gene/protein named entity recognition.GENETAG:一个用于基因/蛋白质命名实体识别的带标注语料库。
BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S3. doi: 10.1186/1471-2105-6-S1-S3. Epub 2005 May 24.
9
Facts from text--is text mining ready to deliver?文本中的事实——文本挖掘准备好发挥作用了吗?
PLoS Biol. 2005 Feb;3(2):e65. doi: 10.1371/journal.pbio.0030065.
10
Tagging gene and protein names in biomedical text.在生物医学文本中标记基因和蛋白质名称。
Bioinformatics. 2002 Aug;18(8):1124-32. doi: 10.1093/bioinformatics/18.8.1124.