Dehkharghanian Taher, Rahnamayan Shahryar, Tizhoosh Hamid R
Annu Int Conf IEEE Eng Med Biol Soc. 2020 Jul;2020:5308-5311. doi: 10.1109/EMBC44109.2020.9176699.
In this paper, we introduce a new dataset for cancer research containing somatic mutation states of 536 genes of the Cancer Gene Census (CGC). We used somatic mutation information from the Cancer Genome Atlas (TCGA) projects to create this dataset. As preliminary investigations, we employed machine learning techniques, including k-Nearest Neighbors, Decision Tree, Random Forest, and Artificial Neural Networks (ANNs) to evaluate the potential of these somatic mutations for classification of cancer types. We compared our models on accuracy, precision, recall, and F1-score. We observed that ANNs outperformed the other models with F1-score of 0.36 and overall classification accuracy of 40%, and precision ranging from 12% to 92% for different cancer types. The 40% accuracy is significantly higher than random guessing which would have resulted in 3% overall classification accuracy. Although the model has relatively low overall accuracy, it has an average classification specificity of 98%. The ANN achieved high precision scores (> 0.7) for 5 of the 33 cancer types. The introduced dataset can be used for research on TCGA data, such as survival analysis, histopathology image analysis and content-based image retrieval. The dataset is available online for download: https://kimialab.uwaterloo.ca/kimia/.
在本文中,我们引入了一个用于癌症研究的新数据集,其中包含癌症基因普查(CGC)中536个基因的体细胞突变状态。我们利用来自癌症基因组图谱(TCGA)项目的体细胞突变信息创建了这个数据集。作为初步研究,我们采用了机器学习技术,包括k近邻、决策树、随机森林和人工神经网络(ANN),来评估这些体细胞突变在癌症类型分类方面的潜力。我们在准确率、精确率、召回率和F1分数方面对我们的模型进行了比较。我们观察到,人工神经网络的表现优于其他模型,其F1分数为0.36,总体分类准确率为40%,不同癌症类型的精确率在12%至92%之间。40%的准确率显著高于随机猜测,随机猜测的总体分类准确率为3%。尽管该模型的总体准确率相对较低,但其平均分类特异性为98%。人工神经网络在33种癌症类型中的5种上取得了高精度分数(>0.7)。引入的数据集可用于对TCGA数据的研究,如生存分析、组织病理学图像分析和基于内容的图像检索。该数据集可在线下载:https://kimialab.uwaterloo.ca/kimia/ 。