Department of Computer Science and Technology, School of Electronics and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China.
Institute of Data Science and Information Quality, Shaanxi Engineering Research Center of Medical and Health Big Data, Xi'an Jiaotong University, Xi'an 710049, China.
Brief Bioinform. 2022 Sep 20;23(5). doi: 10.1093/bib/bbac375.
Copy number variation (CNV) is a class of key biomarkers in many complex traits and diseases. Detecting CNV from sequencing data is a substantial bioinformatics problem and a standard requirement in clinical practice. Although many proposed CNV detection approaches exist, the core statistical model at their foundation is weakened by two critical computational issues: (i) identifying the optimal setting on the sliding window and (ii) correcting for bias and noise. We designed a statistical process model to overcome these limitations by calculating regional read depths via an exponentially weighted moving average strategy. A one-run detection of CNVs of various lengths is then achieved by a dynamic sliding window, whose size is self-adopted according to the weighted averages. We also designed a novel bias/noise reduction model, accompanied by the moving average, which can handle complicated patterns and extend training data. This model, called PEcnv, accurately detects CNVs ranging from kb-scale to chromosome-arm level. The model performance was validated with simulation samples and real samples. Comparative analysis showed that PEcnv outperforms current popular approaches. Notably, PEcnv provided considerable advantages in detecting small CNVs (1 kb-1 Mb) in panel sequencing data. Thus, PEcnv fills the gap left by existing methods focusing on large CNVs. PEcnv may have broad applications in clinical testing where panel sequencing is the dominant strategy. Availability and implementation: Source code is freely available at https://github.com/Sherwin-xjtu/PEcnv.
拷贝数变异 (CNV) 是许多复杂特征和疾病的关键生物标志物类别。从测序数据中检测 CNV 是一个重要的生物信息学问题,也是临床实践的标准要求。尽管存在许多提出的 CNV 检测方法,但它们基础的核心统计模型受到两个关键计算问题的削弱:(i) 确定滑动窗口的最佳设置,(ii) 纠正偏差和噪声。我们设计了一种统计过程模型,通过使用指数加权移动平均策略计算区域读取深度来克服这些限制。然后,通过动态滑动窗口实现了各种长度的 CNV 的一次性检测,其大小根据加权平均值自适应调整。我们还设计了一种新的偏差/噪声降低模型,与移动平均相结合,可以处理复杂的模式并扩展训练数据。该模型称为 PEcnv,可以准确检测从 kb 级到染色体臂级的 CNV。该模型的性能通过模拟样本和真实样本进行了验证。比较分析表明,PEcnv 优于当前流行的方法。值得注意的是,PEcnv 在面板测序数据中小 CNV(1 kb-1 Mb)的检测中具有显著优势。因此,PEcnv 填补了现有方法专注于大 CNV 留下的空白。PEcnv 可能在面板测序是主要策略的临床测试中有广泛的应用。
源代码可在 https://github.com/Sherwin-xjtu/PEcnv 上免费获得。