Center for Applied Genomics, Department of Human Genetics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA.
Perelman School of Medicine, Department of Pediatrics, University of Pennsylvania, Philadelphia, PA 19102, USA.
Brief Bioinform. 2021 Sep 2;22(5). doi: 10.1093/bib/bbaa381.
Copy number variations (CNVs) are an important class of variations contributing to the pathogenesis of many disease phenotypes. Detecting CNVs from genomic data remains difficult, and the most currently applied methods suffer from an unacceptably high false positive rate. A common practice is to have human experts manually review original CNV calls for filtering false positives before further downstream analysis or experimental validation. Here, we propose DeepCNV, a deep learning-based tool, intended to replace human experts when validating CNV calls, focusing on the calls made by one of the most accurate CNV callers, PennCNV. The sophistication of the deep neural network algorithm is enriched with over 10 000 expert-scored samples that are split into training and testing sets. Variant confidence, especially for CNVs, is a main roadblock impeding the progress of linking CNVs with the disease. We show that DeepCNV adds to the confidence of the CNV calls with an optimal area under the receiver operating characteristic curve of 0.909, exceeding other machine learning methods. The superiority of DeepCNV was also benchmarked and confirmed using an experimental wet-lab validation dataset. We conclude that the improvement obtained by DeepCNV results in significantly fewer false positive results and failures to replicate the CNV association results.
拷贝数变异(CNVs)是导致许多疾病表型发病机制的重要变异类型。从基因组数据中检测 CNVs 仍然很困难,目前应用的最先进方法存在不可接受的高假阳性率。一种常见的做法是让人类专家在进行下游分析或实验验证之前,对原始 CNV 调用进行手动审查,以过滤假阳性。在这里,我们提出了 DeepCNV,这是一种基于深度学习的工具,旨在在验证 CNV 调用时替代人类专家,重点关注最准确的 CNV 调用者之一 PennCNV 所做的调用。深度神经网络算法的复杂性通过超过 10000 个经过专家评分的样本得到丰富,这些样本被分为训练集和测试集。变体置信度,特别是对于 CNVs,是阻碍将 CNVs 与疾病联系起来的主要障碍。我们表明,DeepCNV 通过最佳接收者操作特征曲线下的面积 0.909 提高了 CNV 调用的置信度,超过了其他机器学习方法。DeepCNV 的优越性还使用实验湿实验室验证数据集进行了基准测试和确认。我们得出结论,DeepCNV 的改进导致假阳性结果显著减少,并且未能复制 CNV 关联结果。