Suppr超能文献

来自巴西一家三级医院的脓毒症患者匿名出院小结数据集,用于自然语言处理应用。

Dataset of anonymized discharge summaries of sepsis patients from a Brazilian tertiary hospital for NLP applications.

作者信息

Silva Rildo Pinto da, Pazin-Filho Antonio

机构信息

Department of Internal Medicine, Ribeirão Preto Medical School, University of São Paulo - av. Bandeirantes, 3.900 - 6o. andar, HC-FMRP, Monte Alegre, Ribeirão Preto, SP.

出版信息

Data Brief. 2025 Jun 18;61:111804. doi: 10.1016/j.dib.2025.111804. eCollection 2025 Aug.

Abstract

The availability of Brazilian Portuguese health record text datasets for Natural Language Processing (NLP) applications is limited, especially for educational purposes. The main reason for this is data sensitivity, which dictates the need for accurate data anonymization. This article describes a new dataset compiled to help bridge the gap in publicly available information in this area. The data were extracted from discharge summaries in the electronic health record system of a tertiary teaching hospital (Hospital das Clínicas, Ribeirão Preto Medical School). Health records were filtered to identify adult patients diagnosed with sepsis (ICD-10). This diagnosis was chosen because patients with sepsis generally require an extended stay in the hospital and, therefore, have discharged summaries with long text. The data were curated manually by a physician to exclude discharge summaries with incomplete descriptions, resulting in 387 cases. The texts were processed to exclude special characters, expand standard medical abbreviations and anonymize the data. The anonymization process was conducted in two steps: unsupervised anonymization using GLiNER, followed by supervised anonymization using a spaCy model trained by the author to identify named entities. Data related to key structured clinical variables (length of hospital stay, number of specialties involved, ICU admission, palliative care status, discharge outcome) were also extracted from the original health records and combined with each summary. A manual medical record review of the selected cases was performed to ensure data quality and the efficacy of anonymization, excluding all records that did not contain relevant medical information. The resulting dataset comprises 200 anonymized Brazilian Portuguese discharge summaries, along with their respective associated variables, displayed in a tabular format. The data provided in this article offers a valuable, practical resource featuring real medical data for teaching and learning basic NLP techniques, such as text preprocessing, named entity recognition, text classification, and topic modeling.

摘要

用于自然语言处理(NLP)应用的巴西葡萄牙语健康记录文本数据集非常有限,尤其是用于教育目的时。主要原因是数据敏感性,这决定了需要进行准确的数据匿名化处理。本文描述了一个新编译的数据集,以帮助弥合该领域公开可用信息的差距。数据是从一家三级教学医院(里贝朗普雷图医学院临床医院)的电子健康记录系统中的出院小结中提取的。对健康记录进行筛选,以识别诊断为败血症(ICD - 10)的成年患者。选择这个诊断是因为败血症患者通常需要在医院延长住院时间,因此出院小结文本较长。数据由一名医生手动整理,以排除描述不完整的出院小结,最终得到387个病例。对文本进行处理,以排除特殊字符、扩展标准医学缩写并对数据进行匿名化。匿名化过程分两步进行:首先使用GLiNER进行无监督匿名化,然后使用作者训练的用于识别命名实体的spaCy模型进行有监督匿名化。还从原始健康记录中提取了与关键结构化临床变量(住院时间、涉及的专科数量、重症监护病房入院情况、姑息治疗状态、出院结果)相关的数据,并与每个小结相结合。对所选病例进行了手动病历审查,以确保数据质量和匿名化的有效性,排除所有不包含相关医疗信息的记录。最终的数据集包含200份匿名的巴西葡萄牙语出院小结及其各自相关的变量,以表格形式呈现。本文提供的数据为教学和学习基本NLP技术(如文本预处理、命名实体识别、文本分类和主题建模)提供了一个有价值的实用资源,其具有真实的医疗数据。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/37d7/12269971/8ea506e30ade/gr1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验