Center for Precision Health School of Biomedical Informatics The University of Texas Health Science Center at Houston (UTHealth) Houston TX 77030 USA.
MD Anderson Cancer Center UTHealth Graduate School of Biomedical Sciences Houston TX 77030 USA.
Adv Sci (Weinh). 2021 Mar 8;8(9):2004958. doi: 10.1002/advs.202004958. eCollection 2021 May.
Approximately 15% of human cancers are estimated to be attributed to viruses. Virus sequences can be integrated into the host genome, leading to genomic instability and carcinogenesis. Here, a new deep convolutional neural network (CNN) model is developed with attention architecture, namely DeepVISP, for accurately predicting oncogenic virus integration sites (VISs) in the human genome. Using the curated benchmark integration data of three viruses, hepatitis B virus (HBV), human herpesvirus (HPV), and Epstein-Barr virus (EBV), DeepVISP achieves high accuracy and robust performance for all three viruses through automatically learning informative features and essential genomic positions only from the DNA sequences. In comparison, DeepVISP outperforms conventional machine learning methods by 8.43-34.33% measured by area under curve (AUC) value enhancement in three viruses. Moreover, DeepVISP can decode -regulatory factors that are potentially involved in virus integration and tumorigenesis, such as HOXB7, IKZF1, and LHX6. These findings are supported by multiple lines of evidence in literature. The clustering analysis of the informative motifs reveales that the representative k-mers in clusters could help guide virus recognition of the host genes. A user-friendly web server is developed for predicting putative oncogenic VISs in the human genome using DeepVISP.
据估计,大约 15%的人类癌症归因于病毒。病毒序列可以整合到宿主基因组中,导致基因组不稳定和致癌作用。在这里,开发了一种具有注意力架构的新深度卷积神经网络 (CNN) 模型,即 DeepVISP,用于准确预测人类基因组中的致癌病毒整合位点 (VIS)。使用三种病毒(乙型肝炎病毒 (HBV)、人类疱疹病毒 (HPV) 和爱泼斯坦-巴尔病毒 (EBV))的精心整理的基准整合数据,DeepVISP 通过仅从 DNA 序列自动学习信息丰富的特征和必要的基因组位置,为所有三种病毒实现了高精度和稳健的性能。相比之下,DeepVISP 通过 AUC 值增强测量在三种病毒中分别提高了 8.43%至 34.33%,优于传统机器学习方法。此外,DeepVISP 可以解码可能参与病毒整合和肿瘤发生的调节因子,如 HOXB7、IKZF1 和 LHX6。这些发现得到了文献中多条证据的支持。信息基序的聚类分析表明,簇中的代表性 k-mer 有助于指导宿主基因的病毒识别。开发了一个用户友好的网络服务器,用于使用 DeepVISP 预测人类基因组中的潜在致癌 VIS。