用于医学诊断的机器学习驱动的生物标志物选择。

Machine learning driven biomarker selection for medical diagnosis.

作者信息

Bavikadi Divyagna, Agarwal Ayushi, Ganta Shashank, Chung Yunro, Song Lusheng, Qiu Ji, Shakarian Paulo

机构信息

Fulton Schools of Engineering, Arizona State University, Tempe, Arizona, United States of America.

Biodesign Center for Personalized Diagnostics, Arizona State University, Tempe, Arizona, United States of America.

出版信息

PLoS One. 2025 Jun 11;20(6):e0322620. doi: 10.1371/journal.pone.0322620. eCollection 2025.

DOI:10.1371/journal.pone.0322620

PMID:40498685

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12157214/

Abstract

Recent advances in experimental methods have enabled researchers to collect data on thousands of analytes simultaneously. This has led to correlational studies that associated molecular measurements with diseases such as Alzheimer's, Liver, and Gastric Cancer. However, the use of thousands of biomarkers selected from the analytes is not practical for real-world medical diagnosis and is likely undesirable due to potentially formed spurious correlations. In this study, we evaluate 4 different methods for biomarker selection and 5 different machine learning (ML) classifiers for identifying correlations-evaluating 20 approaches in all. We found that contemporary methods outperform previously reported logistic regression in cases where 3 and 10 biomarkers are permitted. When specificity is fixed at 0.9, ML approaches produced a sensitivity of 0.240 (3 biomarkers) and 0.520 (10 biomarkers), while standard logistic regression provided a sensitivity of 0.000 (3 biomarkers) and 0.040 (10 biomarkers). We also noted that causal-based methods for biomarker selection proved to be the most performant when fewer biomarkers were permitted, while univariate feature selection was the most performant when a greater number of biomarkers were permitted.

摘要

实验方法的最新进展使研究人员能够同时收集数千种分析物的数据。这导致了相关性研究，将分子测量与阿尔茨海默病、肝病和胃癌等疾病联系起来。然而，从分析物中选择数千种生物标志物用于实际医疗诊断并不实用，而且由于可能形成虚假相关性，可能也不可取。在本研究中，我们评估了4种不同的生物标志物选择方法和5种不同的机器学习（ML）分类器来识别相关性，总共评估了20种方法。我们发现，在允许使用3种和10种生物标志物的情况下，当代方法优于先前报道的逻辑回归。当特异性固定为0.9时，ML方法的灵敏度为0.240（3种生物标志物）和0.520（10种生物标志物），而标准逻辑回归的灵敏度为0.000（3种生物标志物）和0.040（10种生物标志物）。我们还注意到，当允许使用的生物标志物较少时，基于因果关系的生物标志物选择方法表现最佳，而当允许使用的生物标志物较多时，单变量特征选择表现最佳。