Dasari Chandra Mohan, Bhukya Raju
National Institute of Technology, Warangal, Telangana 506004 India.
Appl Intell (Dordr). 2022;52(3):3002-3017. doi: 10.1007/s10489-021-02572-3. Epub 2021 Jun 25.
Viral infection causes a wide variety of human diseases including cancer and COVID-19. Viruses invade host cells and associate with host molecules, potentially disrupting the normal function of hosts that leads to fatal diseases. Novel viral genome prediction is crucial for understanding the complex viral diseases like AIDS and Ebola. While most existing computational techniques classify viral genomes, the efficiency of the classification depends solely on the structural features extracted. The state-of-the-art DNN models achieved excellent performance by automatic extraction of classification features, but the degree of model explainability is relatively poor. During model training for viral prediction, proposed CNN, CNN-LSTM based methods (EdeepVPP, EdeepVPP-hybrid) automatically extracts features. EdeepVPP also performs model interpretability in order to extract the most important patterns that cause viral genomes through learned filters. It is an interpretable CNN model that extracts vital biologically relevant patterns (features) from feature maps of viral sequences. The EdeepVPP-hybrid predictor outperforms all the existing methods by achieving 0.992 mean AUC-ROC and 0.990 AUC-PR on 19 human metagenomic contig experiment datasets using 10-fold cross-validation. We evaluate the ability of CNN filters to detect patterns across high average activation values. To further asses the robustness of EdeepVPP model, we perform leave-one-experiment-out cross-validation. It can work as a recommendation system to further analyze the raw sequences labeled as 'unknown' by alignment-based methods. We show that our interpretable model can extract patterns that are considered to be the most important features for predicting virus sequences through learned filters.
病毒感染会引发包括癌症和新冠疫情在内的多种人类疾病。病毒侵入宿主细胞并与宿主分子相互作用,这可能会破坏宿主的正常功能,进而导致致命疾病。新型病毒基因组预测对于理解诸如艾滋病和埃博拉等复杂病毒疾病至关重要。虽然现有的大多数计算技术对病毒基因组进行分类,但其分类效率完全取决于所提取的结构特征。最先进的深度神经网络(DNN)模型通过自动提取分类特征取得了优异的性能,但其模型可解释性程度相对较差。在用于病毒预测的模型训练过程中,所提出的基于卷积神经网络(CNN)、卷积神经网络-长短期记忆网络(CNN-LSTM)的方法(EdeepVPP、EdeepVPP-hybrid)会自动提取特征。EdeepVPP还进行模型可解释性分析,以便通过学习到的滤波器提取导致病毒基因组的最重要模式。它是一个可解释的CNN模型,可从病毒序列的特征图中提取重要的生物学相关模式(特征)。在使用10折交叉验证的19个人类宏基因组重叠群实验数据集上,EdeepVPP-hybrid预测器的平均曲线下面积-受试者工作特征曲线(AUC-ROC)为0.992,曲线下面积-精确率-召回率曲线(AUC-PR)为0.990,优于所有现有方法。我们评估了CNN滤波器检测高平均激活值模式的能力。为了进一步评估EdeepVPP模型的稳健性,我们进行了留一实验交叉验证。它可以作为一个推荐系统,以进一步分析基于比对方法标记为“未知”的原始序列。我们表明,我们的可解释模型可以通过学习到的滤波器提取被认为是预测病毒序列最重要特征的模式。