堆叠集成学习模型利用来自宏转录组学的宿主转录组数据诊断肺部感染。

Stacking ensemble learning models diagnose pulmonary infections using host transcriptome data from metatranscriptomics.

作者信息

Zhang Tian, Deng Ying, Wang Wentao, Zhao Zhe, Wu Yiling, Wang Haoqian, Xia Shutao, Liao Weifang, Liao Weijie

机构信息

Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, People's Republic of China.

Peng Cheng Laboratory, Shenzhen, China.

出版信息

Sci Rep. 2025 Aug 20;15(1):30516. doi: 10.1038/s41598-025-15914-9.

DOI:10.1038/s41598-025-15914-9

PMID:40835893

Abstract

The prompt diagnosis of pulmonary infections with unknown etiology in patients in severe condition remains a challenge due to the lack of rapid and effective diagnostic methods. While metatranscriptomic sequencing offers a powerful approach, its clinical utility is often limited by issues of timeliness. In this study, we conducted metatranscriptomic sequencing on bronchoalveolar lavage fluid (BALF) collected from critically ill, severely ill, and ICU patients. Based on microbial detection results, patients were classified into four types: negative, bacterial infection, viral infection, and fungal infection. To identify host gene expression signatures associated with infection, we screened characteristic genes from human metatranscriptomic data by comparing 70% of patients with confirmed infections vs. non-infections. Leveraging these characteristic genes, we constructed classification sub-models employing 13 types of machine learning algorithms, and we further integrated these sub-models into stacking-based ensemble models with Lasso regression, resulting in diagnostic models that required only a small set of gene expression inputs. The average performance of five-fold cross-validation demonstrated high diagnostic accuracy: distinguishing infection from non-infection (AUC = 0.984), bacterial infection from non-bacterial infection (AUC = 0.98), and viral infection from non- viral infection (AUC = 0.98). Test cohorts' results demonstrated the method's high diagnostic accuracy consistency with metatranscriptomic sequencing in discerning patient infection status (AUC = 0.865) and the type of infection (viral: AUC = 0.934, bacterial: AUC = 0.871). Our study presented a rapid and inexpensive adjunctive diagnostic strategy that achieves diagnostic accuracy comparable to metatranscriptomic sequencing, enabling timely identification of both infection status and type in pulmonary infections.

摘要

由于缺乏快速有效的诊断方法，对重症患者的不明病因肺部感染进行及时诊断仍然是一项挑战。虽然宏转录组测序提供了一种强大的方法，但其临床应用往往受到及时性问题的限制。在本研究中，我们对从危重症、重症和ICU患者收集的支气管肺泡灌洗液（BALF）进行了宏转录组测序。根据微生物检测结果，将患者分为四类：阴性、细菌感染、病毒感染和真菌感染。为了识别与感染相关的宿主基因表达特征，我们通过比较70%确诊感染患者与未感染患者的人类宏转录组数据来筛选特征基因。利用这些特征基因，我们采用13种机器学习算法构建了分类子模型，并进一步将这些子模型整合到基于Lasso回归的堆叠集成模型中，从而得到仅需要少量基因表达输入的诊断模型。五折交叉验证的平均性能显示出高诊断准确性：区分感染与未感染（AUC = 0.984）、细菌感染与非细菌感染（AUC = 0.98）以及病毒感染与非病毒感染（AUC = 0.98）。测试队列的结果表明，该方法在辨别患者感染状态（AUC = 0.865）和感染类型（病毒：AUC = 0.934，细菌：AUC = 0.871）方面与宏转录组测序具有高度一致的诊断准确性。我们的研究提出了一种快速且廉价的辅助诊断策略，其诊断准确性与宏转录组测序相当，能够及时识别肺部感染的感染状态和类型。