用于数据挖掘任务的开箱即用文本挖掘工具的功能评估。

Functional evaluation of out-of-the-box text-mining tools for data-mining tasks.

作者信息

Jung Kenneth, LePendu Paea, Iyer Srinivasan, Bauer-Mehren Anna, Percha Bethany, Shah Nigam H

机构信息

Program in Biomedical Informatics, Stanford University, Stanford, California, USA.

Center for Biomedical Informatics Research, Stanford University, Stanford, California, USA.

出版信息

J Am Med Inform Assoc. 2015 Jan;22(1):121-31. doi: 10.1136/amiajnl-2014-002902. Epub 2014 Oct 21.

DOI:10.1136/amiajnl-2014-002902

PMID:25336595

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4433377/

Abstract

OBJECTIVE

The trade-off between the speed and simplicity of dictionary-based term recognition and the richer linguistic information provided by more advanced natural language processing (NLP) is an area of active discussion in clinical informatics. In this paper, we quantify this trade-off among text processing systems that make different trade-offs between speed and linguistic understanding. We tested both types of systems in three clinical research tasks: phase IV safety profiling of a drug, learning adverse drug-drug interactions, and learning used-to-treat relationships between drugs and indications.

MATERIALS

We first benchmarked the accuracy of the NCBO Annotator and REVEAL in a manually annotated, publically available dataset from the 2008 i2b2 Obesity Challenge. We then applied the NCBO Annotator and REVEAL to 9 million clinical notes from the Stanford Translational Research Integrated Database Environment (STRIDE) and used the resulting data for three research tasks.

RESULTS

There is no significant difference between using the NCBO Annotator and REVEAL in the results of the three research tasks when using large datasets. In one subtask, REVEAL achieved higher sensitivity with smaller datasets.

CONCLUSIONS

For a variety of tasks, employing simple term recognition methods instead of advanced NLP methods results in little or no impact on accuracy when using large datasets. Simpler dictionary-based methods have the advantage of scaling well to very large datasets. Promoting the use of simple, dictionary-based methods for population level analyses can advance adoption of NLP in practice.

摘要

目的

基于词典的术语识别的速度与简单性和更先进的自然语言处理（NLP）提供的更丰富语言信息之间的权衡，是临床信息学中一个活跃的讨论领域。在本文中，我们对在速度和语言理解之间做出不同权衡的文本处理系统之间的这种权衡进行了量化。我们在三项临床研究任务中测试了这两种类型的系统：药物的IV期安全性分析、学习药物-药物不良相互作用以及学习药物与适应症之间的治疗关系。

材料

我们首先在来自2008年i2b2肥胖挑战赛的一个人工标注的公开可用数据集中，对NCBO注释器和REVEAL的准确性进行了基准测试。然后，我们将NCBO注释器和REVEAL应用于来自斯坦福转化研究综合数据库环境（STRIDE）的900万份临床记录，并将所得数据用于三项研究任务。

结果

在使用大型数据集时，在三项研究任务的结果中，使用NCBO注释器和REVEAL之间没有显著差异。在一个子任务中，REVEAL在较小数据集上实现了更高的灵敏度。

结论

对于各种任务，在使用大型数据集时，采用简单的术语识别方法而非先进的NLP方法对准确性几乎没有影响或没有影响。基于词典的更简单方法具有很好地扩展到非常大的数据集的优势。推广使用基于词典的简单方法进行人群水平分析可以推动NLP在实践中的应用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/afb9/4433377/ac777a8ae51a/ocu921f1p.jpg

相似文献

Functional evaluation of out-of-the-box text-mining tools for data-mining tasks.用于数据挖掘任务的开箱即用文本挖掘工具的功能评估。

J Am Med Inform Assoc. 2015 Jan;22(1):121-31. doi: 10.1136/amiajnl-2014-002902. Epub 2014 Oct 21.

Scaling-up NLP Pipelines to Process Large Corpora of Clinical Notes.扩大自然语言处理管道以处理大量临床记录语料库。

Methods Inf Med. 2015;54(6):548-52. doi: 10.3414/ME14-02-0018. Epub 2015 Nov 4.

PISTON: Predicting drug indications and side effects using topic modeling and natural language processing.PISTON：使用主题建模和自然语言处理预测药物适应证和副作用。

J Biomed Inform. 2018 Nov;87:96-107. doi: 10.1016/j.jbi.2018.09.015. Epub 2018 Sep 27.

J Biomed Inform. 2019 Feb;90:103103. doi: 10.1016/j.jbi.2019.103103. Epub 2019 Jan 9.

On the creation of a clinical gold standard corpus in Spanish: Mining adverse drug reactions.关于创建西班牙语临床金标准语料库：挖掘药物不良反应

J Biomed Inform. 2015 Aug;56:318-32. doi: 10.1016/j.jbi.2015.06.016. Epub 2015 Jun 30.

Generation of silver standard concept annotations from biomedical texts with special relevance to phenotypes.从与表型特别相关的生物医学文本中生成银标准概念注释。

PLoS One. 2015 Jan 21;10(1):e0116040. doi: 10.1371/journal.pone.0116040. eCollection 2015.

Natural language processing of symptoms documented in free-text narratives of electronic health records: a systematic review.电子健康记录中自由文本叙述的症状的自然语言处理：系统评价。

J Am Med Inform Assoc. 2019 Apr 1;26(4):364-379. doi: 10.1093/jamia/ocy173.

PhenoDEF: a corpus for annotating sentences with information of phenotype definitions in biomedical literature.PhenoDEF：一个用于在生物医学文献中注释具有表型定义信息的句子的语料库。

J Biomed Semantics. 2022 Jun 11;13(1):17. doi: 10.1186/s13326-022-00272-6.

From narrative descriptions to MedDRA: automagically encoding adverse drug reactions.从叙述性描述到 MedDRA：自动编码药物不良反应。

J Biomed Inform. 2018 Aug;84:184-199. doi: 10.1016/j.jbi.2018.07.001. Epub 2018 Jul 4.

A comparative study of large language model-based zero-shot inference and task-specific supervised classification of breast cancer pathology reports.基于大语言模型的零样本推理与乳腺癌病理报告任务特定监督分类的比较研究。

J Am Med Inform Assoc. 2024 Oct 1;31(10):2315-2327. doi: 10.1093/jamia/ocae146.

引用本文的文献

Clinical entity augmented retrieval for clinical information extraction.用于临床信息提取的临床实体增强检索

NPJ Digit Med. 2025 Jan 19;8(1):45. doi: 10.1038/s41746-024-01377-1.

Applying Natural Language Processing to Textual Data From Clinical Data Warehouses: Systematic Review.将自然语言处理应用于临床数据仓库中的文本数据：系统评价。

JMIR Med Inform. 2023 Dec 15;11:e42477. doi: 10.2196/42477.

Named Entity Recognition in Electronic Health Records: A Methodological Review.电子健康记录中的命名实体识别：方法学综述

Healthc Inform Res. 2023 Oct;29(4):286-300. doi: 10.4258/hir.2023.29.4.286. Epub 2023 Oct 31.

Natural Language Processing and Graph Theory: Making Sense of Imaging Records in a Novel Representation Frame.自然语言处理与图论：在一种新型表示框架中理解影像记录

JMIR Med Inform. 2022 Dec 21;10(12):e40534. doi: 10.2196/40534.

A narrative review on the validity of electronic health record-based research in epidemiology.基于电子健康记录的流行病学研究的有效性的叙述性综述。

BMC Med Res Methodol. 2021 Oct 27;21(1):234. doi: 10.1186/s12874-021-01416-5.

DES-Tcell is a knowledgebase for exploring immunology-related literature.DES-T 细胞是一个用于探索免疫学相关文献的知识库。

Sci Rep. 2021 Jul 12;11(1):14344. doi: 10.1038/s41598-021-93809-1.

Automatic Diagnosis of Spinal Disorders on Radiographic Images: Leveraging Existing Unstructured Datasets With Natural Language Processing.利用自然语言处理技术对现有非结构化数据集进行脊柱疾病的影像学自动诊断

Global Spine J. 2023 Jun;13(5):1257-1266. doi: 10.1177/21925682211026910. Epub 2021 Jul 5.

A scoping review of clinical decision support tools that generate new knowledge to support decision making in real time.实时决策支持中生成新知识的临床决策支持工具的范围综述。

J Am Med Inform Assoc. 2020 Dec 9;27(12):1968-1976. doi: 10.1093/jamia/ocaa200.

Finding missed cases of familial hypercholesterolemia in health systems using machine learning.利用机器学习在医疗系统中发现家族性高胆固醇血症的漏诊病例。

NPJ Digit Med. 2019 Apr 11;2:23. doi: 10.1038/s41746-019-0101-5. eCollection 2019.

[Natural language processing in radiology : Neither trivial nor impossible].[放射学中的自然语言处理：既非轻而易举也非不可能]

Radiologe. 2019 Sep;59(9):828-832. doi: 10.1007/s00117-019-0555-0.

本文引用的文献

Automated detection of off-label drug use.非适应证用药的自动检测。

PLoS One. 2014 Feb 19;9(2):e89324. doi: 10.1371/journal.pone.0089324. eCollection 2014.

Mining the ultimate phenome repository.挖掘终极表型组库。

Nat Biotechnol. 2013 Dec;31(12):1095-7. doi: 10.1038/nbt.2757.

Network analysis of unstructured EHR data for clinical research.用于临床研究的非结构化电子健康记录数据的网络分析

AMIA Jt Summits Transl Sci Proc. 2013 Mar 18;2013:14-8. eCollection 2013.

Electronic health records-driven phenotyping: challenges, recent advances, and perspectives.电子健康记录驱动的表型分析：挑战、最新进展与展望

J Am Med Inform Assoc. 2013 Dec;20(e2):e206-11. doi: 10.1136/amiajnl-2013-002428.

Profiling risk factors for chronic uveitis in juvenile idiopathic arthritis: a new model for EHR-based research.青少年特发性关节炎慢性葡萄膜炎风险因素分析：基于电子病历的研究的新模型。

Pediatr Rheumatol Online J. 2013 Dec 3;11(1):45. doi: 10.1186/1546-0096-11-45.

Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data.系统比较电子病历数据的表型全基因组关联研究和全基因组关联研究数据。

Nat Biotechnol. 2013 Dec;31(12):1102-10. doi: 10.1038/nbt.2749.

Normalization and standardization of electronic health records for high-throughput phenotyping: the SHARPn consortium.电子健康记录的高通量表型标准化和规范化：SHARPn 联盟。

J Am Med Inform Assoc. 2013 Dec;20(e2):e341-8. doi: 10.1136/amiajnl-2013-001939. Epub 2013 Nov 4.

Mining clinical text for signals of adverse drug-drug interactions.从临床文本中挖掘药物-药物不良相互作用信号。

J Am Med Inform Assoc. 2014 Mar-Apr;21(2):353-62. doi: 10.1136/amiajnl-2013-001612. Epub 2013 Oct 24.

Automated extraction of clinical traits of multiple sclerosis in electronic medical records.电子病历中多发性硬化症临床特征的自动提取。

J Am Med Inform Assoc. 2013 Dec;20(e2):e334-40. doi: 10.1136/amiajnl-2013-001999. Epub 2013 Oct 22.

Defining a comprehensive verotype using electronic health records for personalized medicine.利用电子健康记录为个性化医疗定义全面的综合基因型。

J Am Med Inform Assoc. 2013 Dec;20(e2):e232-8. doi: 10.1136/amiajnl-2013-001932. Epub 2013 Sep 3.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

用于数据挖掘任务的开箱即用文本挖掘工具的功能评估。

Functional evaluation of out-of-the-box text-mining tools for data-mining tasks.

作者信息

机构信息

出版信息

OBJECTIVE

MATERIALS

RESULTS

CONCLUSIONS

目的

材料

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献