Suppr超能文献

医疗综合生物样本库中糖尿病的自动样本注释

Automated sample annotation for diabetes mellitus in healthcare integrated biobanking.

作者信息

Stolp Johannes, Weber Christoph, Ammon Danny, Scherag André, Fischer Claudia, Kloos Christof, Wolf Gunter, Schulze P Christian, Settmacher Utz, Bauer Michael, Stallmach Andreas, Kiehntopf Michael, Betz Boris

机构信息

Department of Clinical Chemistry and Laboratory Diagnostics and Integrated Biobank Jena (IBBJ), Jena University Hospital - Friedrich Schiller University Jena, Jena, Germany.

Data Integration Center, Jena University Hospital - Friedrich Schiller University Jena, Jena, Germany.

出版信息

Comput Struct Biotechnol J. 2024 Oct 23;24:724-733. doi: 10.1016/j.csbj.2024.10.033. eCollection 2024 Dec.

Abstract

Healthcare integrated biobanking describes the annotation and collection of residual samples from hospitalized patients for research purposes. The central idea of the current work is to establish an automated workflow for sample annotation, selection and storage for diabetes mellitus. This is challenging due to incomplete data at the time of sample selection. The study evaluates a machine learning (ML) and natural language processing (NLP) based two-step procedure for timely and precise sample annotation for diabetes mellitus. Electronic health record data of 785 persons were extracted from the hospital information system. In the first step, a conditional inference forest (CIF) model was trained and tested based on laboratory values from the first 72 h of the hospital stay using test- (n = 550) and training data sets (n = 235). Performance was compared with a simple laboratory cut-off classifier (LCC) and a logistic regression (LR) model. Algorithms based on laboratory values, ICD-10 codes or information from discharge summaries extracted by a natural language processing software (NLP-DS) were evaluated as a second (review) step designed to increase the precision of annotations. For the first step, recall/precision/F1-score/accuracy were 71 %/86 %/0.78/0.82 for CIF and 77 %/70 %/0.74/0.75 for LR compared to 73 %/68 %/0.70/0.72 for LCC. NLP-DS was the best-performing second (review) step (93 %/100 %/0.97/0.97). Combining first-step models with NLP-DS increased precision to 100 % for all procedures (66 %/100 %/0.80/0.85 for CIF&NLP-DS, 72 %/100 %/0.84/87.2 for LR&NLP-DS and 66 %/100 %/0.80/0.85 for LCC&NLP-DS). The number of samples removed by NLP-DS was higher for LR&NLP-DS and LCC&NLP-DS (removal rate 35 % and 38 % of initially selected samples) compared to CIF&NLP-DS (removal rate of 20 %). The developed two-step procedure is an efficient implementable method for timely and precise annotation of samples from diabetic hospitalized patients.

摘要

医疗保健综合生物样本库是指为研究目的而对住院患者的剩余样本进行注释和收集。当前工作的核心思想是建立一个用于糖尿病样本注释、选择和存储的自动化工作流程。由于样本选择时数据不完整,这具有挑战性。该研究评估了一种基于机器学习(ML)和自然语言处理(NLP)的两步程序,用于及时、精确地对糖尿病样本进行注释。从医院信息系统中提取了785人的电子健康记录数据。第一步,使用测试数据集(n = 550)和训练数据集(n = 235),基于住院前72小时的实验室值训练和测试条件推断森林(CIF)模型。将其性能与简单的实验室临界值分类器(LCC)和逻辑回归(LR)模型进行比较。作为第二步(审查),评估了基于实验室值、ICD - 10代码或由自然语言处理软件提取的出院小结信息的算法(NLP - DS),旨在提高注释的精度。对于第一步,CIF的召回率/精确率/F1分数/准确率分别为71%/86%/0.78/0.82,LR分别为77%/70%/0.74/0.75,而LCC分别为73%/68%/0.70/0.72。NLP - DS是表现最佳的第二步(审查)步骤(93%/100%/0.97/0.97)。将第一步模型与NLP - DS相结合,所有程序的精确率都提高到了100%(CIF&NLP - DS为66%/100%/0.80/0.85,LR&NLP - DS为72%/100%/0.84/87.2,LCC&NLP - DS为66%/100%/0.80/0.85)。与CIF&NLP - DS(去除率20%)相比,LR&NLP - DS和LCC&NLP - DS被NLP - DS去除的样本数量更多(去除率分别为最初选择样本的35%和38%)。所开发的两步程序是一种高效可实施的方法,用于及时、精确地注释糖尿病住院患者的样本。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7ed0/11635603/0c72c92ca511/ga1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验