Quang Daniel, Chen Yifei, Xie Xiaohui
Department of Computer Science and Center for Complex Biological Systems, University of California, Irvine, CA 92697, USA Department of Computer Science and Center for Complex Biological Systems, University of California, Irvine, CA 92697, USA.
Department of Computer Science and Center for Complex Biological Systems, University of California, Irvine, CA 92697, USA.
Bioinformatics. 2015 Mar 1;31(5):761-3. doi: 10.1093/bioinformatics/btu703. Epub 2014 Oct 22.
Annotating genetic variants, especially non-coding variants, for the purpose of identifying pathogenic variants remains a challenge. Combined annotation-dependent depletion (CADD) is an algorithm designed to annotate both coding and non-coding variants, and has been shown to outperform other annotation algorithms. CADD trains a linear kernel support vector machine (SVM) to differentiate evolutionarily derived, likely benign, alleles from simulated, likely deleterious, variants. However, SVMs cannot capture non-linear relationships among the features, which can limit performance. To address this issue, we have developed DANN. DANN uses the same feature set and training data as CADD to train a deep neural network (DNN). DNNs can capture non-linear relationships among features and are better suited than SVMs for problems with a large number of samples and features. We exploit Compute Unified Device Architecture-compatible graphics processing units and deep learning techniques such as dropout and momentum training to accelerate the DNN training. DANN achieves about a 19% relative reduction in the error rate and about a 14% relative increase in the area under the curve (AUC) metric over CADD's SVM methodology.
All data and source code are available at https://cbcl.ics.uci.edu/public_data/DANN/.
为了识别致病变异而对基因变异(尤其是非编码变异)进行注释仍然是一项挑战。综合注释依赖缺失(CADD)是一种旨在对编码和非编码变异进行注释的算法,并且已被证明优于其他注释算法。CADD训练一个线性核支持向量机(SVM)来区分进化衍生的、可能良性的等位基因与模拟的、可能有害的变异。然而,支持向量机无法捕捉特征之间的非线性关系,这可能会限制性能。为了解决这个问题,我们开发了DANN。DANN使用与CADD相同的特征集和训练数据来训练一个深度神经网络(DNN)。深度神经网络可以捕捉特征之间的非线性关系,并且比支持向量机更适合处理具有大量样本和特征的问题。我们利用与统一计算设备架构兼容的图形处理单元以及诸如随机失活和动量训练等深度学习技术来加速深度神经网络的训练。与CADD的支持向量机方法相比,DANN在错误率上实现了约19%的相对降低,在曲线下面积(AUC)指标上实现了约14%的相对增加。