Zeng Zexian, Mao Chengsheng, Vo Andy, Li Xiaoyu, Nugent Janna Ore, Khan Seema A, Clare Susan E, Luo Yuan
Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, 750 N Lake Shore Drive Room 11-189, Chicago, IL, 60611, USA.
Department of Data Sciences, Dana-Farber Cancer Institute, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
BMC Bioinformatics. 2021 Oct 25;22(Suppl 4):491. doi: 10.1186/s12859-021-04400-4.
Genetic information is becoming more readily available and is increasingly being used to predict patient cancer types as well as their subtypes. Most classification methods thus far utilize somatic mutations as independent features for classification and are limited by study power. We aim to develop a novel method to effectively explore the landscape of genetic variants, including germline variants, and small insertions and deletions for cancer type prediction.
We proposed DeepCues, a deep learning model that utilizes convolutional neural networks to unbiasedly derive features from raw cancer DNA sequencing data for disease classification and relevant gene discovery. Using raw whole-exome sequencing as features, germline variants and somatic mutations, including insertions and deletions, were interactively amalgamated for feature generation and cancer prediction. We applied DeepCues to a dataset from TCGA to classify seven different types of major cancers and obtained an overall accuracy of 77.6%. We compared DeepCues to conventional methods and demonstrated a significant overall improvement (p < 0.001). Strikingly, using DeepCues, the top 20 breast cancer relevant genes we have identified, had a 40% overlap with the top 20 known breast cancer driver genes.
Our results support DeepCues as a novel method to improve the representational resolution of DNA sequencings and its power in deriving features from raw sequences for cancer type prediction, as well as discovering new cancer relevant genes.
遗传信息越来越容易获取,并且越来越多地用于预测患者的癌症类型及其亚型。迄今为止,大多数分类方法利用体细胞突变作为分类的独立特征,并且受到研究能力的限制。我们旨在开发一种新方法,以有效地探索遗传变异的全景,包括种系变异以及用于癌症类型预测的小插入和缺失。
我们提出了DeepCues,这是一种深度学习模型,它利用卷积神经网络从原始癌症DNA测序数据中无偏差地提取特征,用于疾病分类和相关基因发现。使用原始全外显子组测序作为特征,种系变异和体细胞突变(包括插入和缺失)被交互式合并以生成特征和进行癌症预测。我们将DeepCues应用于来自TCGA的数据集,以对七种不同类型的主要癌症进行分类,总体准确率达到77.6%。我们将DeepCues与传统方法进行比较,结果显示有显著的总体改进(p < 0.001)。引人注目的是,使用DeepCues,我们确定的前20个乳腺癌相关基因与前20个已知乳腺癌驱动基因有40%的重叠。
我们的结果支持DeepCues作为一种新方法,可提高DNA测序的表征分辨率及其从原始序列中提取特征用于癌症类型预测以及发现新的癌症相关基因的能力。