Department of Analytics, Harrisburg University of Science and Technology, Harrisburg, PA 17101, USA.
BioMark Diagnostics Inc., Richmond, BC V6X 2W2, Canada.
Int J Mol Sci. 2024 Nov 4;25(21):11835. doi: 10.3390/ijms252111835.
Breast cancer remains a major public health concern, and early detection is crucial for improving survival rates. Metabolomics offers the potential to develop non-invasive screening and diagnostic tools based on metabolic biomarkers. However, the inherent complexity of metabolomic datasets and the high dimensionality of biomarkers complicates the identification of diagnostically relevant features, with multiple studies demonstrating limited consensus on the specific metabolites involved. Unlike previous studies that rely on singular feature selection techniques such as Partial Least Square (PLS) or LASSO regression, this research combines supervised and unsupervised machine learning methods with random sampling strategies, offering a more robust and interpretable approach to feature selection. This study aimed to identify a parsimonious and robust set of biomarkers for breast cancer diagnosis using metabolomics data. Plasma samples from 185 breast cancer patients and 53 controls (from the Cooperative Human Tissue Network, USA) were analyzed. This study also overcomes the common issue of dataset imbalance by using propensity score matching (PSM), which ensures reliable comparisons between cancer and control groups. We employed Univariate Naïve Bayes, L2-regularized Support Vector Classifier (SVC), Principal Component Analysis (PCA), and feature engineering techniques to refine and select the most informative features. Our best-performing feature set comprised 11 biomarkers, including 9 metabolites (SM(OH) C22:2, SM C18:0, C0, C3OH, C14:2OH, C16:2OH, LysoPC a C18:1, PC aa C36:0 and Asparagine), a metabolite ratio (Kynurenine-to-Tryptophan), and 1 demographic variable (Age), achieving an area under the ROC curve (AUC) of 98%. These results demonstrate the potential for a robust, cost-effective, and non-invasive breast cancer screening and diagnostic tool, offering significant clinical value for early detection and personalized patient management.
乳腺癌仍然是一个主要的公共卫生关注点,早期检测对于提高生存率至关重要。代谢组学提供了基于代谢生物标志物开发非侵入性筛查和诊断工具的潜力。然而,代谢组学数据集的固有复杂性和生物标志物的高维性使得确定诊断相关特征变得复杂,多项研究表明,具体涉及的代谢物存在有限的共识。与以前依赖于单一特征选择技术(如偏最小二乘 (PLS) 或 LASSO 回归)的研究不同,这项研究结合了监督和无监督机器学习方法与随机抽样策略,为特征选择提供了更稳健和可解释的方法。这项研究旨在使用代谢组学数据确定用于乳腺癌诊断的简约而稳健的生物标志物集。分析了来自 185 名乳腺癌患者和 53 名对照者(来自美国合作人体组织网络)的血浆样本。该研究还通过使用倾向评分匹配 (PSM) 克服了数据集不平衡的常见问题,这确保了癌症组和对照组之间的可靠比较。我们采用了单变量朴素贝叶斯、L2 正则化支持向量分类器 (SVC)、主成分分析 (PCA) 和特征工程技术来精炼和选择最具信息量的特征。表现最佳的特征集由 11 个生物标志物组成,包括 9 个代谢物(SM(OH) C22:2、SM C18:0、C0、C3OH、C14:2OH、C16:2OH、LysoPC a C18:1、PC aa C36:0 和天冬酰胺)、一个代谢物比率(犬尿氨酸/色氨酸)和 1 个人口统计学变量(年龄),ROC 曲线下面积 (AUC) 为 98%。这些结果表明,开发一种稳健、经济高效、非侵入性的乳腺癌筛查和诊断工具具有潜力,为早期检测和个性化患者管理提供了重要的临床价值。