PhenoDEF：一个用于在生物医学文献中注释具有表型定义信息的句子的语料库。

PhenoDEF: a corpus for annotating sentences with information of phenotype definitions in biomedical literature.

机构信息

Department of Biohealth Informatics, Indiana University School of Informatics and Computing, Indianapolis, IN, USA.

Medical Informatics Unit, Department of Medical Education, College of Medicine, King Saud University, Riyadh, Saudi Arabia.

出版信息

J Biomed Semantics. 2022 Jun 11;13(1):17. doi: 10.1186/s13326-022-00272-6.

DOI:10.1186/s13326-022-00272-6

PMID:35690873

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9188713/

Abstract

BACKGROUND

Adverse events induced by drug-drug interactions are a major concern in the United States. Current research is moving toward using electronic health record (EHR) data, including for adverse drug events discovery. One of the first steps in EHR-based studies is to define a phenotype for establishing a cohort of patients. However, phenotype definitions are not readily available for all phenotypes. One of the first steps of developing automated text mining tools is building a corpus. Therefore, this study aimed to develop annotation guidelines and a gold standard corpus to facilitate building future automated approaches for mining phenotype definitions contained in the literature. Furthermore, our aim is to improve the understanding of how these published phenotype definitions are presented in the literature and how we annotate them for future text mining tasks.

RESULTS

Two annotators manually annotated the corpus on a sentence-level for the presence of evidence for phenotype definitions. Three major categories (inclusion, intermediate, and exclusion) with a total of ten dimensions were proposed characterizing major contextual patterns and cues for presenting phenotype definitions in published literature. The developed annotation guidelines were used to annotate the corpus that contained 3971 sentences: 1923 out of 3971 (48.4%) for the inclusion category, 1851 out of 3971 (46.6%) for the intermediate category, and 2273 out of 3971 (57.2%) for exclusion category. The highest number of annotated sentences was 1449 out of 3971 (36.5%) for the "Biomedical & Procedure" dimension. The lowest number of annotated sentences was 49 out of 3971 (1.2%) for "The use of NLP". The overall percent inter-annotator agreement was 97.8%. Percent and Kappa statistics also showed high inter-annotator agreement across all dimensions.

CONCLUSIONS

The corpus and annotation guidelines can serve as a foundational informatics approach for annotating and mining phenotype definitions in literature, and can be used later for text mining applications.

摘要

背景

药物相互作用引起的不良事件是美国的一个主要关注点。目前的研究正在转向使用电子健康记录（EHR）数据，包括用于发现不良药物事件。基于 EHR 的研究的第一步之一是为建立患者队列定义表型。然而，并非所有表型都有现成的表型定义。开发自动化文本挖掘工具的第一步之一是构建语料库。因此，本研究旨在开发注释指南和黄金标准语料库，以促进构建未来用于挖掘文献中表型定义的自动化方法。此外，我们的目标是提高对文献中发表的表型定义的呈现方式以及我们如何为未来的文本挖掘任务对其进行注释的理解。

结果

两名注释员在句子级别上手动注释语料库，以确定是否存在表型定义的证据。提出了三个主要类别（包含、中间和排除），共有十个维度，用于描述在发表文献中呈现表型定义的主要上下文模式和提示。使用开发的注释指南对包含 3971 个句子的语料库进行注释：3971 个句子中的 1923 个（48.4%）为包含类别，3971 个句子中的 1851 个（46.6%）为中间类别，3971 个句子中的 2273 个（57.2%）为排除类别。注释的句子数最多的是“生物医学和程序”维度的 1449 个（36.5%）。注释的句子数最少的是“自然语言处理的使用”维度的 49 个（1.2%）。总体跨注释员一致性百分比为 97.8%。百分比和 Kappa 统计数据也显示了所有维度的跨注释员高度一致性。

结论

语料库和注释指南可作为在文献中注释和挖掘表型定义的基础信息学方法，并可用于以后的文本挖掘应用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4988/9188713/3b8e40986c34/13326_2022_272_Fig1_HTML.jpg

相似文献

PhenoDEF: a corpus for annotating sentences with information of phenotype definitions in biomedical literature.PhenoDEF：一个用于在生物医学文献中注释具有表型定义信息的句子的语料库。

J Biomed Semantics. 2022 Jun 11;13(1):17. doi: 10.1186/s13326-022-00272-6.

New directions in biomedical text annotation: definitions, guidelines and corpus construction.生物医学文本注释的新方向：定义、指南与语料库构建

BMC Bioinformatics. 2006 Jul 25;7:356. doi: 10.1186/1471-2105-7-356.

Dense Annotation of Free-Text Critical Care Discharge Summaries from an Indian Hospital and Associated Performance of a Clinical NLP Annotator.印度一家医院的重症监护出院小结自由文本的密集标注及临床自然语言处理标注器的相关性能

J Med Syst. 2016 Aug;40(8):187. doi: 10.1007/s10916-016-0541-2. Epub 2016 Jun 24.

Syntactic parsing of clinical text: guideline and corpus development with handling ill-formed sentences.临床文本的句法分析：处理不规范句子的指南和语料库开发。

J Am Med Inform Assoc. 2013 Nov-Dec;20(6):1168-77. doi: 10.1136/amiajnl-2013-001810. Epub 2013 Aug 1.

Using text mining techniques to extract phenotypic information from the PhenoCHF corpus.使用文本挖掘技术从PhenoCHF语料库中提取表型信息。

BMC Med Inform Decis Mak. 2015;15 Suppl 2(Suppl 2):S3. doi: 10.1186/1472-6947-15-S2-S3. Epub 2015 Jun 15.

A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools.语料库全文期刊文章是一种强大的评估工具，可用于揭示生物医学自然语言处理工具性能的差异。

BMC Bioinformatics. 2012 Aug 17;13:207. doi: 10.1186/1471-2105-13-207.

NCBI disease corpus: a resource for disease name recognition and concept normalization.NCBI疾病语料库：一种用于疾病名称识别和概念规范化的资源。

J Biomed Inform. 2014 Feb;47:1-10. doi: 10.1016/j.jbi.2013.12.006. Epub 2014 Jan 3.

On the creation of a clinical gold standard corpus in Spanish: Mining adverse drug reactions.关于创建西班牙语临床金标准语料库：挖掘药物不良反应

J Biomed Inform. 2015 Aug;56:318-32. doi: 10.1016/j.jbi.2015.06.016. Epub 2015 Jun 30.

Building a comprehensive syntactic and semantic corpus of Chinese clinical texts.构建中文临床文本的综合句法和语义语料库。

J Biomed Inform. 2017 May;69:203-217. doi: 10.1016/j.jbi.2017.04.006. Epub 2017 Apr 9.

A Five-Step Workflow to Manually Annotate Unstructured Data into Training Dataset for Natural Language Processing.将非结构化数据手动注释到自然语言处理训练数据集中的五步工作流程。

Stud Health Technol Inform. 2024 Jan 25;310:109-113. doi: 10.3233/SHTI230937.

引用本文的文献

Natural language processing for detecting adverse drug events: A systematic review protocol.用于检测药物不良事件的自然语言处理：一项系统综述方案

NIHR Open Res. 2024 Dec 10;3:67. doi: 10.3310/nihropenres.13504.2. eCollection 2023.

Detection of Patient-Level Immunotherapy-Related Adverse Events (irAEs) from Clinical Narratives of Electronic Health Records: A High-Sensitivity Artificial Intelligence Model.从电子健康记录的临床叙述中检测患者层面的免疫治疗相关不良事件（irAEs）：一种高灵敏度人工智能模型。

Pragmat Obs Res. 2024 Dec 20;15:243-252. doi: 10.2147/POR.S468253. eCollection 2024.

Generalizability of machine learning methods in detecting adverse drug events from clinical narratives in electronic medical records.机器学习方法在从电子病历中的临床叙述中检测药物不良事件方面的可推广性。

Front Pharmacol. 2023 Jul 12;14:1218679. doi: 10.3389/fphar.2023.1218679. eCollection 2023.

What Patients Say: Large-Scale Analyses of Replies to the Parkinson's Disease Patient Report of Problems (PD-PROP).患者说：帕金森病患者报告问题（PD-PROP）的大规模分析回复。

J Parkinsons Dis. 2023;13(5):757-767. doi: 10.3233/JPD-225083.

Correction: PhenoDEF: a corpus for annotating sentences with information of phenotype definitions in biomedical literature.更正：PhenoDEF：一个用于在生物医学文献中用表型定义信息注释句子的语料库。

J Biomed Semantics. 2022 Jul 20;13(1):20. doi: 10.1186/s13326-022-00275-3.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

PhenoDEF：一个用于在生物医学文献中注释具有表型定义信息的句子的语料库。

PhenoDEF: a corpus for annotating sentences with information of phenotype definitions in biomedical literature.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献