Suppr超能文献

利用孕中期和孕晚期早期孕妇血液基因表达进行机器学习预测自发性早产:一个警示故事。

Machine learning for the prediction of spontaneous preterm birth using early second and third trimester maternal blood gene expression: A cautionary tale.

作者信息

Hornaday Kylie K, Werbicki Ty, Tough Suzanne C, Wood Stephen L, Anderson David W, Li Constance H, Slater Donna M

机构信息

Department of Physiology and Pharmacology, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada.

Department of Community of Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada.

出版信息

PLoS One. 2025 Jun 27;20(6):e0310937. doi: 10.1371/journal.pone.0310937. eCollection 2025.

Abstract

Spontaneous preterm birth (sPTB) remains a significant global health challenge and a leading cause of neonatal mortality and morbidity. Despite advancements in neonatal care, the prediction of sPTB remains elusive, in part due to complex etiologies and heterogeneous patient populations. This study aimed to validate and extend information on gene expression biomarkers previously described for predicting sPTB using maternal whole blood from the All Our Families pregnancy cohort study based in Calgary, Canada. The results of this study are two-fold: first, using additional replicates of maternal blood samples from the All Our Families cohort, we were unable to repeat the findings of a 2016 study which identified top maternal gene expression predictors for sPTB. Second, we conducted a secondary analysis of the original gene expression dataset from the 2016 study using five modelling approaches (random forest, elastic net regression, unregularized logistic regression, L2-regularized logistic regression, and multilayer perceptron neural network) followed by external validation using a pregnancy cohort based in Detroit, USA. The top performing model (random forest classification) suggested promising performance (area under the receiver operating curve, AUROC 0.99 in the training set), but performance was significantly degraded on the test set (AUROC 0.54) and further degraded in external validation (AUROC 0.50), suggesting poor generalizability, likely due to overfitting exacerbated by a low feature-to-noise ratio. Similar performance was observed in the other four learning models. Prediction was not improved when using higher complexity machine learning (e.g., neural network) approaches over traditional statistical learning (e.g., logistic regression). These findings underscore the challenges in translating biomarker discovery into clinically useful predictive models for sPTB. This study highlights the critical need for rigorous methodological safeguards and external validation in biomarker research. It also emphasizes the impact of data noise and overfitting on model performance, particularly in high-dimensional omics datasets. Future research should prioritize robust validation strategies and explore mechanistic insights to improve our understanding and prediction of sPTB.

摘要

自发性早产(sPTB)仍然是一项重大的全球健康挑战,也是新生儿死亡和发病的主要原因。尽管新生儿护理取得了进展,但sPTB的预测仍然难以捉摸,部分原因是病因复杂且患者群体异质性大。本研究旨在验证并扩展先前描述的用于预测sPTB的基因表达生物标志物信息,该信息使用了来自加拿大卡尔加里“我们所有的家庭”妊娠队列研究中的孕妇全血。本研究结果有两方面:第一,使用“我们所有的家庭”队列中孕妇血液样本的额外重复样本,我们无法重复2016年一项研究的结果,该研究确定了sPTB的顶级孕妇基因表达预测指标。第二,我们使用五种建模方法(随机森林、弹性网回归、非正则逻辑回归、L2正则逻辑回归和多层感知器神经网络)对2016年研究的原始基因表达数据集进行了二次分析,随后使用美国底特律的一个妊娠队列进行了外部验证。表现最佳的模型(随机森林分类)显示出有前景的性能(训练集中受试者工作特征曲线下面积,AUROC为0.99),但在测试集上性能显著下降(AUROC为0.54),在外部验证中进一步下降(AUROC为0.50),这表明泛化性较差,可能是由于低特征噪声比加剧了过拟合。在其他四个学习模型中也观察到了类似的性能。与传统统计学习方法(如逻辑回归)相比,使用更高复杂度的机器学习方法(如神经网络)时,预测并没有得到改善。这些发现凸显了将生物标志物发现转化为临床上有用的sPTB预测模型所面临的挑战。本研究强调了在生物标志物研究中严格的方法保障和外部验证的迫切需求。它还强调了数据噪声和过拟合对模型性能的影响,特别是在高维组学数据集中。未来的研究应优先考虑稳健的验证策略,并探索机制性见解,以提高我们对sPTB的理解和预测。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3ffe/12204558/40f17c18a9c3/pone.0310937.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验