Liu Yiheng, Liu Junfeng, Wan Jiayi, Hao Hongke, Liu Guangxing, Huang Xia
Department of Biosciences and Bioinformatics, Xi'an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, PR China.
Innovation Academy for Precision Measurement Science and Technology, Chinese Academy of Sciences, Wuhan 430071, PR China.
Comput Struct Biotechnol J. 2025 May 30;27:2288-2297. doi: 10.1016/j.csbj.2025.05.051. eCollection 2025.
Raman spectroscopy extracts rich biochemical information on a single cell, demonstrating significant potential for precise cancer identification. While machine learning enhances spectral analysis efficiency, conventional models remain constrained by data volume. Here, we developed Random Splicing-Convolutional Neural Network (RS-CNN), a deep learning framework that addresses data scarcity through spectral concatenation. By randomly splicing Raman spectra from the same cell line, RS-CNN enhances distinctive spectral features while simultaneously expanding dataset size and improving signal quality. Validation across six breast cancer cell lines demonstrated RS-CNN's superiority over five benchmark models (SVM, LDA, PCA-SVM, PCA-LDA, CNN). With 450 spectra per cell line, RS-CNN achieved 98.63 % classification accuracy compared to conventional models' accuracies of around 85 %. Under data-limited conditions (100 spectra/line), RS-CNN maintained 91.47 % accuracy, outperforming CNN's 70.83 %. The RS-CNN's generalizability was further validated by an independently acquired dataset, achieving at least 94 % classification accuracy. SHAP analysis suggested the spectral region around 980 cm⁻¹ was significant for cancer diagnosis, while the 1158-1160 cm⁻¹and 1603-1607 cm⁻¹ regions were particularly valuable for distinguishing between cancer subtypes. These findings establish RS-CNN as a robust analytical model for clinical Raman diagnostics, particularly valuable in applications requiring high accuracy with limited samples.
拉曼光谱能够提取单个细胞丰富的生化信息,在精确癌症识别方面展现出巨大潜力。虽然机器学习提高了光谱分析效率,但传统模型仍受数据量的限制。在此,我们开发了随机拼接卷积神经网络(RS-CNN),这是一种深度学习框架,通过光谱拼接解决数据稀缺问题。通过随机拼接来自同一细胞系的拉曼光谱,RS-CNN增强了独特的光谱特征,同时扩大了数据集规模并提高了信号质量。对六种乳腺癌细胞系的验证表明,RS-CNN优于五个基准模型(支持向量机、线性判别分析、主成分分析-支持向量机、主成分分析-线性判别分析、卷积神经网络)。每个细胞系有450个光谱时,RS-CNN的分类准确率达到98.63%,而传统模型的准确率约为85%。在数据有限的条件下(每个细胞系100个光谱),RS-CNN保持了91.47%的准确率,优于卷积神经网络的70.83%。RS-CNN的通用性通过一个独立获取的数据集进一步得到验证,分类准确率至少达到94%。SHAP分析表明,980 cm⁻¹附近的光谱区域对癌症诊断具有重要意义,而1158 - 1160 cm⁻¹和1603 - 1607 cm⁻¹区域对于区分癌症亚型特别有价值。这些发现确立了RS-CNN作为临床拉曼诊断的强大分析模型,在需要对有限样本进行高精度分析的应用中尤其有价值。