Robles Edgar E, Jin Ye, Smyth Padhraic, Scheuermann Richard H, Bui Jack D, Wang Huan-You, Oak Jean, Qian Yu
medRxiv. 2023 Feb 10:2023.02.07.23285606. doi: 10.1101/2023.02.07.23285606.
Precise identification of cancer cells in patient samples is essential for accurate diagnosis and clinical monitoring but has been a significant challenge in machine learning approaches for cancer precision medicine. In most scenarios, training data are only available with disease annotation at the subject or sample level. Traditional approaches separate the classification process into multiple steps that are optimized independently. Recent methods either focus on predicting sample-level diagnosis without identifying individual pathologic cells or are less effective for identifying heterogeneous cancer cell phenotypes.
We developed a generalized end-to-end differentiable model, the Cell Scoring Neural Network (CSNN), which takes the available sample-level training data and predicts both the diagnosis of the testing samples and the identity of the diagnostic cells in the sample, simultaneously. The cell-level density differences between samples are linked to the sample diagnosis, which allows the probabilities of individual cells being diagnostic to be calculated using backpropagation. We applied CSNN to two independent clinical flow cytometry datasets for leukemia diagnosis. In both qualitative and quantitative assessments, CSNN outperformed preexisting neural network modeling approaches for both cancer diagnosis and cell-level classification. Post hoc decision trees and 2D dot plots were generated for interpretation of the identified cancer cells, showing that the identified cell phenotypes match the cancer endotypes observed clinically in patient cohorts. Independent data clustering analysis confirmed the identified cancer cell populations.
The source code of CSNN and datasets used in the experiments are publicly available on GitHub and FlowRepository.
Edgar E. Robles: roblesee@uci.edu and Yu Qian: mqian@jcvi.org.
Supplementary data are available on GitHub and at online.
在患者样本中精确识别癌细胞对于准确诊断和临床监测至关重要,但在癌症精准医学的机器学习方法中一直是一项重大挑战。在大多数情况下,训练数据仅在个体或样本水平上带有疾病注释。传统方法将分类过程分为多个独立优化的步骤。最近的方法要么专注于预测样本水平的诊断而不识别单个病理细胞,要么在识别异质性癌细胞表型方面效果较差。
我们开发了一种通用的端到端可微模型,即细胞评分神经网络(CSNN),它利用可用的样本水平训练数据,同时预测测试样本的诊断结果和样本中诊断细胞的身份。样本之间的细胞水平密度差异与样本诊断相关联,这使得可以使用反向传播计算单个细胞为诊断性细胞的概率。我们将CSNN应用于两个独立的用于白血病诊断的临床流式细胞术数据集。在定性和定量评估中,CSNN在癌症诊断和细胞水平分类方面均优于现有的神经网络建模方法。生成了事后决策树和二维点图以解释识别出的癌细胞,结果表明识别出的细胞表型与患者队列中临床观察到的癌症内型相匹配。独立的数据聚类分析证实了识别出的癌细胞群体。
CSNN的源代码和实验中使用的数据集可在GitHub和FlowRepository上公开获取。
埃德加·E·罗夫莱斯:roblesee@uci.edu;钱宇:mqian@jcvi.org。
补充数据可在GitHub和在线获取。