Wen FuDong, Su Yue, Liu Dan, Wang YuPeng, Liu MeiNa
Department of Biostatistics, Public Health College, Harbin Medical University, Harbin City, 150081, Heilongjiang Province, China.
BMC Bioinformatics. 2025 Jul 1;26(1):165. doi: 10.1186/s12859-025-06193-2.
High-dimensional proteomics data present significant challenges in biomarker discovery due to technical noise, feature redundancy, and multicollinearity. Current feature selection methods, including filter, wrapper, and embedded approaches, struggle with stability, sparsity, and computational efficiency. To address these limitations, we propose Soft-Thresholded Compressed Sensing (ST-CS), a hybrid framework integrating 1-bit compressed sensing with K-Medoids clustering. Unlike conventional methods relying on manual thresholds, ST-CS automates feature selection by dynamically partitioning coefficient magnitudes into discriminative biomarkers and noise.
Evaluations on simulated and real-world proteomic datasets demonstrated ST-CS's superiority in feature selection capability and classification performance. In simulations, ST-CS achieved feature selection robustness with balanced sensitivity (> 80%) and specificity (> 99.8%), reducing false discovery rates (FDR) by 20-50% compared to Hard-Thresholded Compressed Sensing (HT-CS). Additionally, it attained superior F1 scores and Matthews Correlation Coefficients (MCC), outperforming HT-CS, LASSO, and SPLSDA in identifying true biomarkers while suppressing noise. For classification performance, ST-CS surpassed all methods in the area under the receiver operating characteristic curve (AUC) across varying noise levels while maintaining sparsity. Applied to Clinical Proteomic Tumor Analysis Consortium (CPTAC) datasets, ST-CS matched HT-CS's classification accuracy (AUC = 97.47% for intrahepatic cholangiocarcinoma) but with 57% fewer selected features (37 vs. 86), demonstrating its dual strength in precision biomarker discovery and predictive accuracy. For glioblastoma data, ST-CS achieved higher AUC (72.71%) than HT-CS (72.15%), LASSO (67.80%), and SPLSDA (71.38%) while retaining a parsimonious feature set (30 vs. 58 features for HT-CS). In ovarian serous cystadenocarcinoma, ST-CS further demonstrated its adaptability, attaining superior AUC (75.86%) over HT-CS (75.61%), LASSO (61.00%), and SPLSDA (70.75%) with only 24 ± 5 selected biomarkers. These results highlight ST-CS's ability to rigorously automate feature selection while balancing classification efficacy, interpretability, and scalability for translational proteomics.
由于技术噪声、特征冗余和多重共线性,高维蛋白质组学数据在生物标志物发现中带来了重大挑战。当前的特征选择方法,包括过滤法、包装法和嵌入式方法,在稳定性、稀疏性和计算效率方面存在困难。为了解决这些局限性,我们提出了软阈值压缩感知(ST-CS),这是一种将1位压缩感知与K-中心点聚类相结合的混合框架。与依赖手动阈值的传统方法不同,ST-CS通过将系数幅度动态划分为有判别力的生物标志物和噪声来自动进行特征选择。
在模拟和真实世界蛋白质组学数据集上的评估证明了ST-CS在特征选择能力和分类性能方面的优越性。在模拟中,ST-CS实现了特征选择的稳健性,具有平衡的灵敏度(>80%)和特异性(>99.8%),与硬阈值压缩感知(HT-CS)相比,错误发现率(FDR)降低了20-50%。此外,它获得了更高的F1分数和马修斯相关系数(MCC),在识别真正的生物标志物同时抑制噪声方面优于HT-CS、LASSO和SPLSDA。对于分类性能,ST-CS在不同噪声水平下的接收器操作特征曲线(AUC)面积方面超过了所有方法,同时保持了稀疏性。应用于临床蛋白质组肿瘤分析联盟(CPTAC)数据集时,ST-CS与HT-CS的分类准确率相匹配(肝内胆管癌的AUC = 97.47%),但所选特征少57%(37个对86个),证明了其在精确生物标志物发现和预测准确性方面的双重优势。对于胶质母细胞瘤数据,ST-CS实现了比HT-CS(72.15%)、LASSO(67.80%)和SPLSDA(71.38%)更高的AUC(72.71%),同时保留了一个简约的特征集(HT-CS为58个特征,ST-CS为30个特征)。在卵巢浆液性囊腺癌中,ST-CS进一步证明了其适应性,仅用24±5个选定的生物标志物就获得了优于HT-CS(75.61%)、LASSO(61.00%)和SPLSDA(70.75%)的AUC(75.86%)。这些结果突出了ST-CS在严格自动进行特征选择的同时,平衡分类效能、可解释性和可扩展性以用于转化蛋白质组学的能力。