• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

罕见病语料库:一个标注了罕见病、其症状和体征的语料库。

The RareDis corpus: A corpus annotated with rare diseases, their signs and symptoms.

机构信息

Tissue Engineering and Regenerative Medicine group, Department of Bioengineering, Universidad Carlos III de Madrid, Avenidad de la Universidad, 30, Leganés 28911, Madrid, Spain.

Human Language and Accesibility Technologies, Computer Science Department, Avenidad de la Universidad 30, Leganés 28911, Madrid, Spain.

出版信息

J Biomed Inform. 2022 Jan;125:103961. doi: 10.1016/j.jbi.2021.103961. Epub 2021 Dec 5.

DOI:10.1016/j.jbi.2021.103961
PMID:34879250
Abstract

Rare diseases affect a small number of people compared to the general population. However, more than 6,000 different rare diseases exist and, in total, they affect more than 300 million people worldwide. Rare diseases share as part of their main problem, the delay in diagnosis and the sparse information available for researchers, clinicians, and patients. Finding a diagnostic can be a very long and frustrating experience for patients and their families. The average diagnostic delay is between 6-8 years. Many of these diseases result in different manifestations among patients, which hampers even more their detection and the correct treatment choice. Therefore, there is an urgent need to increase the scientific and medical knowledge about rare diseases. Natural Language Processing (NLP) can help to extract relevant information about rare diseases to facilitate their diagnosis and treatments, but most NLP techniques require manually annotated corpora. Therefore, our goal is to create a gold standard corpus annotated with rare diseases and their clinical manifestations. It could be used to train and test NLP approaches and the information extracted through NLP could enrich the knowledge of rare diseases, and thereby, help to reduce the diagnostic delay and improve the treatment of rare diseases. The paper describes the selection of 1,041 texts to be included in the corpus, the annotation process and the annotation guidelines. The entities (disease, rare disease, symptom, sign and anaphor) and the relationships (produces, is a, is acron, is synon, increases risk of, anaphora) were annotated. The RareDis corpus contains more than 5,000 rare diseases and almost 6,000 clinical manifestations are annotated. Moreover, the Inter Annotator Agreement evaluation shows a relatively high agreement (F1-measure equal to 83.5% under exact match criteria for the entities and equal to 81.3% for the relations). Based on these results, this corpus is of high quality, supposing a significant step for the field since there is a scarcity of available corpus annotated with rare diseases. This could open the door to further NLP applications, which would facilitate the diagnosis and treatment of these rare diseases and, therefore, would improve dramatically the quality of life of these patients.

摘要

罕见病相较于普通人群而言,影响的人数较少。然而,全球范围内存在超过 6000 种不同的罕见病,总共影响超过 3 亿人。罕见病有一个共同的主要问题,即诊断延迟,以及为研究人员、临床医生和患者提供的信息稀疏。对于患者及其家属来说,找到诊断结果可能是一个非常漫长且令人沮丧的过程。平均诊断延迟时间在 6-8 年之间。这些疾病中的许多会导致患者之间出现不同的表现,这使得它们的检测和正确的治疗选择更加困难。因此,迫切需要增加对罕见病的科学和医学知识。自然语言处理 (NLP) 可以帮助提取有关罕见病的相关信息,以促进它们的诊断和治疗,但大多数 NLP 技术都需要手动标注语料库。因此,我们的目标是创建一个标注有罕见病及其临床表现的黄金标准语料库。它可以用于训练和测试 NLP 方法,并且通过 NLP 提取的信息可以丰富罕见病的知识,从而有助于减少诊断延迟并改善罕见病的治疗效果。本文描述了选择 1041 个文本包含在语料库中、标注过程和标注指南。对实体(疾病、罕见病、症状、体征和回指)和关系(产生、是、是同义词、是缩写词、增加风险、回指)进行了标注。RareDis 语料库包含了 5000 多种罕见病,几乎标注了 6000 种临床表现。此外,通过评估注释者间的一致性(在实体的精确匹配标准下,F1 测度等于 83.5%,关系的 F1 测度等于 81.3%),可以看出存在较高的一致性。基于这些结果,该语料库具有较高的质量,这对该领域来说是一个重要的进步,因为可用的罕见病标注语料库非常稀缺。这为进一步的 NLP 应用打开了大门,这将有助于这些罕见病的诊断和治疗,从而极大地提高这些患者的生活质量。

相似文献

1
The RareDis corpus: A corpus annotated with rare diseases, their signs and symptoms.罕见病语料库:一个标注了罕见病、其症状和体征的语料库。
J Biomed Inform. 2022 Jan;125:103961. doi: 10.1016/j.jbi.2021.103961. Epub 2021 Dec 5.
2
The DDI corpus: an annotated corpus with pharmacological substances and drug-drug interactions.DDI 语料库:一个带有药理学物质和药物相互作用注释的语料库。
J Biomed Inform. 2013 Oct;46(5):914-20. doi: 10.1016/j.jbi.2013.07.011. Epub 2013 Jul 29.
3
Building a comprehensive syntactic and semantic corpus of Chinese clinical texts.构建中文临床文本的综合句法和语义语料库。
J Biomed Inform. 2017 May;69:203-217. doi: 10.1016/j.jbi.2017.04.006. Epub 2017 Apr 9.
4
Exploring deep learning methods for recognizing rare diseases and their clinical manifestations from texts.探索深度学习方法,从文本中识别罕见病及其临床表现。
BMC Bioinformatics. 2022 Jul 6;23(1):263. doi: 10.1186/s12859-022-04810-y.
5
SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks.SemClinBr - 一个用于葡萄牙语临床自然语言处理任务的多机构和多专业的语义注释语料库。
J Biomed Semantics. 2022 May 8;13(1):13. doi: 10.1186/s13326-022-00269-1.
6
FoodBase corpus: a new resource of annotated food entities.FoodBase 语料库:一个新的带注释食物实体资源。
Database (Oxford). 2019 Jan 1;2019. doi: 10.1093/database/baz121.
7
An annotated corpus of clinical trial publications supporting schema-based relational information extraction.支持基于模式的关系信息抽取的临床试验文献标注语料库。
J Biomed Semantics. 2022 May 23;13(1):14. doi: 10.1186/s13326-022-00271-7.
8
Deep neural models for extracting entities and relationships in the new RDD corpus relating disabilities and rare diseases.用于从与残疾和罕见疾病相关的新 RDD 语料库中提取实体和关系的深度神经网络模型。
Comput Methods Programs Biomed. 2018 Oct;164:121-129. doi: 10.1016/j.cmpb.2018.07.007. Epub 2018 Jul 20.
9
A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine.一个用统一医学语言系统(UMLS)实体注释的临床试验语料库,以加强对循证医学的获取。
BMC Med Inform Decis Mak. 2021 Feb 22;21(1):69. doi: 10.1186/s12911-021-01395-z.
10
NCBI disease corpus: a resource for disease name recognition and concept normalization.NCBI疾病语料库:一种用于疾病名称识别和概念规范化的资源。
J Biomed Inform. 2014 Feb;47:1-10. doi: 10.1016/j.jbi.2013.12.006. Epub 2014 Jan 3.

引用本文的文献

1
Comparison of pipelines, seq2seq models, and LLMs for rare disease information extraction.用于罕见病信息提取的管道、序列到序列模型和语言模型的比较。
Nat Lang Process Inf Syst. 2026;15836:49-63. doi: 10.1007/978-3-031-97141-9_4. Epub 2025 Jul 1.
2
Not Fully Synthetic: LLM-based Hybrid Approaches Towards Privacy-Preserving Clinical Note Sharing.非完全合成:基于大语言模型的隐私保护临床笔记共享混合方法。
AMIA Jt Summits Transl Sci Proc. 2025 Jun 10;2025:441-450. eCollection 2025.
3
Large Language Models Struggle in Token-Level Clinical Named Entity Recognition.
大型语言模型在词元级临床命名实体识别方面存在困难。
AMIA Annu Symp Proc. 2025 May 22;2024:748-757. eCollection 2024.
4
An Automatic and End-to-End System for Rare Disease Knowledge Graph Construction Based on Ontology-Enhanced Large Language Models: Development Study.基于本体增强大语言模型的罕见病知识图谱构建自动端到端系统:开发研究
JMIR Med Inform. 2024 Dec 18;12:e60665. doi: 10.2196/60665.
5
Rare disease diagnosis using knowledge guided retrieval augmentation for ChatGPT.利用知识引导检索增强的 ChatGPT 进行罕见病诊断。
J Biomed Inform. 2024 Sep;157:104702. doi: 10.1016/j.jbi.2024.104702. Epub 2024 Jul 29.
6
A tree-based corpus annotated with Cyber-Syndrome, symptoms, and acupoints.基于树的语料库,标注了 Cyber-Syndrome、症状和穴位。
Sci Data. 2024 May 10;11(1):482. doi: 10.1038/s41597-024-03321-0.
7
Identifying and Extracting Rare Diseases and Their Phenotypes with Large Language Models.使用大语言模型识别和提取罕见疾病及其表型
J Healthc Inform Res. 2024 Jan 5;8(2):438-461. doi: 10.1007/s41666-023-00155-0. eCollection 2024 Jun.
8
Using Clinician-Patient WeChat Group Communication Data to Identify Symptom Burdens in Patients With Uterine Fibroids Under Focused Ultrasound Ablation Surgery Treatment: Qualitative Study.利用医患微信交流群数据识别聚焦超声消融手术治疗子宫肌瘤患者的症状负担:定性研究
JMIR Form Res. 2023 Sep 1;7:e43995. doi: 10.2196/43995.
9
Exploring deep learning methods for recognizing rare diseases and their clinical manifestations from texts.探索深度学习方法,从文本中识别罕见病及其临床表现。
BMC Bioinformatics. 2022 Jul 6;23(1):263. doi: 10.1186/s12859-022-04810-y.
10
Creation and evaluation of full-text literature-derived, feature-weighted disease models of genetically determined developmental disorders.基于文献的全文本生成、特征加权的遗传性发育障碍疾病模型的建立与评估。
Database (Oxford). 2022 Jun 7;2022. doi: 10.1093/database/baac038.