Suppr超能文献

罕见病语料库:一个标注了罕见病、其症状和体征的语料库。

The RareDis corpus: A corpus annotated with rare diseases, their signs and symptoms.

机构信息

Tissue Engineering and Regenerative Medicine group, Department of Bioengineering, Universidad Carlos III de Madrid, Avenidad de la Universidad, 30, Leganés 28911, Madrid, Spain.

Human Language and Accesibility Technologies, Computer Science Department, Avenidad de la Universidad 30, Leganés 28911, Madrid, Spain.

出版信息

J Biomed Inform. 2022 Jan;125:103961. doi: 10.1016/j.jbi.2021.103961. Epub 2021 Dec 5.

Abstract

Rare diseases affect a small number of people compared to the general population. However, more than 6,000 different rare diseases exist and, in total, they affect more than 300 million people worldwide. Rare diseases share as part of their main problem, the delay in diagnosis and the sparse information available for researchers, clinicians, and patients. Finding a diagnostic can be a very long and frustrating experience for patients and their families. The average diagnostic delay is between 6-8 years. Many of these diseases result in different manifestations among patients, which hampers even more their detection and the correct treatment choice. Therefore, there is an urgent need to increase the scientific and medical knowledge about rare diseases. Natural Language Processing (NLP) can help to extract relevant information about rare diseases to facilitate their diagnosis and treatments, but most NLP techniques require manually annotated corpora. Therefore, our goal is to create a gold standard corpus annotated with rare diseases and their clinical manifestations. It could be used to train and test NLP approaches and the information extracted through NLP could enrich the knowledge of rare diseases, and thereby, help to reduce the diagnostic delay and improve the treatment of rare diseases. The paper describes the selection of 1,041 texts to be included in the corpus, the annotation process and the annotation guidelines. The entities (disease, rare disease, symptom, sign and anaphor) and the relationships (produces, is a, is acron, is synon, increases risk of, anaphora) were annotated. The RareDis corpus contains more than 5,000 rare diseases and almost 6,000 clinical manifestations are annotated. Moreover, the Inter Annotator Agreement evaluation shows a relatively high agreement (F1-measure equal to 83.5% under exact match criteria for the entities and equal to 81.3% for the relations). Based on these results, this corpus is of high quality, supposing a significant step for the field since there is a scarcity of available corpus annotated with rare diseases. This could open the door to further NLP applications, which would facilitate the diagnosis and treatment of these rare diseases and, therefore, would improve dramatically the quality of life of these patients.

摘要

罕见病相较于普通人群而言,影响的人数较少。然而,全球范围内存在超过 6000 种不同的罕见病,总共影响超过 3 亿人。罕见病有一个共同的主要问题,即诊断延迟,以及为研究人员、临床医生和患者提供的信息稀疏。对于患者及其家属来说,找到诊断结果可能是一个非常漫长且令人沮丧的过程。平均诊断延迟时间在 6-8 年之间。这些疾病中的许多会导致患者之间出现不同的表现,这使得它们的检测和正确的治疗选择更加困难。因此,迫切需要增加对罕见病的科学和医学知识。自然语言处理 (NLP) 可以帮助提取有关罕见病的相关信息,以促进它们的诊断和治疗,但大多数 NLP 技术都需要手动标注语料库。因此,我们的目标是创建一个标注有罕见病及其临床表现的黄金标准语料库。它可以用于训练和测试 NLP 方法,并且通过 NLP 提取的信息可以丰富罕见病的知识,从而有助于减少诊断延迟并改善罕见病的治疗效果。本文描述了选择 1041 个文本包含在语料库中、标注过程和标注指南。对实体(疾病、罕见病、症状、体征和回指)和关系(产生、是、是同义词、是缩写词、增加风险、回指)进行了标注。RareDis 语料库包含了 5000 多种罕见病,几乎标注了 6000 种临床表现。此外,通过评估注释者间的一致性(在实体的精确匹配标准下,F1 测度等于 83.5%,关系的 F1 测度等于 81.3%),可以看出存在较高的一致性。基于这些结果,该语料库具有较高的质量,这对该领域来说是一个重要的进步,因为可用的罕见病标注语料库非常稀缺。这为进一步的 NLP 应用打开了大门,这将有助于这些罕见病的诊断和治疗,从而极大地提高这些患者的生活质量。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验