Department of Computer Science, University of California, Irvine, CA 92697, United States.
Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, United States.
Bioinformatics. 2023 Oct 3;39(10). doi: 10.1093/bioinformatics/btad585.
Precise identification of cancer cells in patient samples is essential for accurate diagnosis and clinical monitoring but has been a significant challenge in machine learning approaches for cancer precision medicine. In most scenarios, training data are only available with disease annotation at the subject or sample level. Traditional approaches separate the classification process into multiple steps that are optimized independently. Recent methods either focus on predicting sample-level diagnosis without identifying individual pathologic cells or are less effective for identifying heterogeneous cancer cell phenotypes.
We developed a generalized end-to-end differentiable model, the Cell Scoring Neural Network (CSNN), which takes sample-level training data and predicts the diagnosis of the testing samples and the identity of the diagnostic cells in the sample, simultaneously. The cell-level density differences between samples are linked to the sample diagnosis, which allows the probabilities of individual cells being diagnostic to be calculated using backpropagation. We applied CSNN to two independent clinical flow cytometry datasets for leukemia diagnosis. In both qualitative and quantitative assessments, CSNN outperformed preexisting neural network modeling approaches for both cancer diagnosis and cell-level classification. Post hoc decision trees and 2D dot plots were generated for interpretation of the identified cancer cells, showing that the identified cell phenotypes match the cancer endotypes observed clinically in patient cohorts. Independent data clustering analysis confirmed the identified cancer cell populations.
The source code of CSNN and datasets used in the experiments are publicly available on GitHub (http://github.com/erobl/csnn). Raw FCS files can be downloaded from FlowRepository (ID: FR-FCM-Z6YK).
在机器学习方法应用于癌症精准医疗中,精确识别患者样本中的癌细胞对于准确诊断和临床监测至关重要,但这一直是一个重大挑战。在大多数情况下,训练数据仅在主题或样本级别具有疾病注释。传统方法将分类过程分为多个独立优化的步骤。最近的方法要么专注于预测样本级别的诊断,而不识别单个病理细胞,要么对于识别异质的癌症细胞表型效果较差。
我们开发了一种通用的端到端可微分模型,即细胞评分神经网络(CSNN),它可以接受样本级别的训练数据,并同时预测测试样本的诊断结果和样本中诊断细胞的身份。样本之间的细胞级密度差异与样本诊断相关联,这允许使用反向传播计算单个细胞的诊断概率。我们将 CSNN 应用于两个独立的临床流式细胞术数据集,用于白血病诊断。在定性和定量评估中,CSNN 在癌症诊断和细胞级分类方面均优于现有的神经网络建模方法。事后决策树和 2D 点图用于解释鉴定的癌细胞,表明鉴定的癌细胞表型与患者队列中临床观察到的癌症内型相匹配。独立的数据聚类分析证实了鉴定的癌细胞群体。
CSNN 的源代码和实验中使用的数据集可在 GitHub(http://github.com/erobl/csnn)上公开获取。原始 FCS 文件可从 FlowRepository(ID:FR-FCM-Z6YK)下载。