Liu Hua-Ping, Wang Dongwen, Lai Hung-Ming
National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital & Shenzhen Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Shenzhen 518116, China.
Aiphaqua Genomics Research Unit, Taipei 111, Taiwan.
Comput Struct Biotechnol J. 2022 May 23;20:2672-2679. doi: 10.1016/j.csbj.2022.05.035. eCollection 2022.
There is a growing need to build a model that uses single cell RNA-seq (scRNA-seq) to separate malignant cells from nonmalignant cells and to identify tumor of origin of single cells and/or circulating tumor cells (CTCs). Currently, it is infeasible to build a tumor of origin model learnt from scRNA-seq by machine learning (ML). We then wondered if an ML model learnt from bulk transcriptomes is applicable to scRNA-seq to infer single cells' tumor presence and further indicate their tumor of origin. We used k-nearest neighbors, one-versus-all support vector machine, one-versus-one support vector machine, random forest and introduced scTumorTrace to conduct a pioneering experiment containing leukocytes and seven major cancer types where bulk RNA-seq and scRNA-seq data were available. 13 ML models learnt from bulk RNA-seq were all reliable to use (F-score > 96%) shown by a validation set of bulk transcriptomes, but none of them was applicable to scRNA-seq except scTumorTrace. Making inferences from bulk RNA-seq to scRNA-seq was impaired by feature selection and improved by log2-transformed TPM units. scTumorTrace with transcriptome-wide 2-tuples showed F-score beyond 98.74 and 94.29% in inferring tumor presence and tumor of origin at single-cell resolution and correctly identified 45 single candidate prostate CTCs but lineage-confirmed non-CTCs as leukocytes. We concluded that modern ML techniques are quantitative and could hardly address the raised questions. scTumorTrace with transcriptome-wide 2-tuples is qualitative, standardization-free and not subject to log2-transformed quantities, enabling us to infer tumor presence of single cell transcriptomes and their tumor of origin from bulk transcriptomes.
越来越需要构建一种模型,该模型使用单细胞RNA测序(scRNA-seq)将恶性细胞与非恶性细胞分离,并识别单细胞和/或循环肿瘤细胞(CTC)的肿瘤起源。目前,通过机器学习(ML)从scRNA-seq中学习构建肿瘤起源模型是不可行的。然后,我们想知道从批量转录组中学习的ML模型是否适用于scRNA-seq,以推断单细胞的肿瘤存在并进一步指示其肿瘤起源。我们使用k近邻、一对多支持向量机、一对一支持向量机、随机森林,并引入scTumorTrace进行了一项开创性实验,该实验包含白细胞和七种主要癌症类型,其中有批量RNA-seq和scRNA-seq数据。从批量RNA-seq中学习的13个ML模型在批量转录组验证集显示下都是可靠可用的(F值>96%),但除了scTumorTrace外,它们都不适用于scRNA-seq。从批量RNA-seq推断scRNA-seq受到特征选择的影响,而通过log2转换的每百万转录本每千碱基读取数(TPM)单位得到改善。具有全转录组二元组的scTumorTrace在推断单细胞分辨率下的肿瘤存在和肿瘤起源时,F值分别超过98.74%和94.29%,并正确识别了45个单个候选前列腺CTC,但将谱系确认的非CTC鉴定为白细胞。我们得出结论,现代ML技术是定量的,几乎无法解决提出的问题。具有全转录组二元组的scTumorTrace是定性的,无需标准化且不受log2转换量的影响,使我们能够从批量转录组推断单细胞转录组的肿瘤存在及其肿瘤起源。