Ahn Taejin, Kim Kidong, Kim Hyojin, Kim Sarah, Park Sangick, Lee Kyoungbun
Department of Life Science, Handong Global University, Pohang, Republic of Korea.
Department of Obstetrics and Gynecology, Seoul National University Bundang Hospital, Seongnam, Republic of Korea.
Cancer Inform. 2022 Nov 15;21:11769351221135141. doi: 10.1177/11769351221135141. eCollection 2022.
There is a lack of tools for identifying the site of origin in mucinous cancer. This study aimed to evaluate the performance of a transcriptome-based classifier for identifying the site of origin in mucinous cancer.
Transcriptomic data of 1878 non-mucinous and 82 mucinous cancer specimens, with 7 sites of origin, namely, the uterine cervix (CESC), colon (COAD), pancreas (PAAD), stomach (STAD), uterine endometrium (UCEC), uterine carcinosarcoma (UCS), and ovary (OV), obtained from The Cancer Genome Atlas, were used as the training and validation sets, respectively. Transcriptomic data of 14 mucinous cancer specimens from a tissue archive were used as the test set. For identifying the site of origin, a set of 100 differentially expressed genes for each site of origin was selected. After removing multiple iterations of the same gene, 427 genes were chosen, and their RNA expression profiles, at each site of origin, were used to train the deep neural network classifier. The performance of the classifier was estimated using the training, validation, and test sets.
The accuracy of the model in the training set was 0.998, while that in the validation set was 0.939 (77/82). In the test set which is newly sequenced from a tissue archive, the model showed an accuracy of 0.857 (12/14). t-SNE analysis revealed that samples in the test set were part of the clusters obtained for the training set.
Although limited by small sample size, we showed that a transcriptome-based classifier could correctly identify the site of origin of mucinous cancer.
目前缺乏用于识别黏液性癌原发部位的工具。本研究旨在评估基于转录组的分类器在识别黏液性癌原发部位方面的性能。
从癌症基因组图谱获取的1878例非黏液性癌和82例黏液性癌标本的转录组数据,其原发部位有7个,分别为子宫颈(CESC)、结肠(COAD)、胰腺(PAAD)、胃(STAD)、子宫内膜(UCEC)、子宫癌肉瘤(UCS)和卵巢(OV),分别用作训练集和验证集。来自一个组织存档的14例黏液性癌标本的转录组数据用作测试集。为了识别原发部位,为每个原发部位选择了一组100个差异表达基因。在去除同一基因的多次重复后,选择了427个基因,并利用它们在每个原发部位的RNA表达谱来训练深度神经网络分类器。使用训练集、验证集和测试集评估分类器的性能。
该模型在训练集中的准确率为0.998,在验证集中为0.939(77/82)。在从组织存档中新测序的测试集中,该模型的准确率为0.857(12/14)。t-SNE分析显示,测试集中的样本是训练集所获聚类的一部分。
尽管受样本量小的限制,但我们表明基于转录组的分类器能够正确识别黏液性癌的原发部位。