基于累积分段回归的基因组序列高效突变点检测

Efficient change-points detection for genomic sequences via cumulative segmented regression.

机构信息

School of Statistics and Mathematics; Interdisciplinary Research Institute of Data Science, Shanghai Lixin University of Accounting and Finance, Shanghai 201209, China.

Statistics and Mathematics School, Yunnan University of Finance and Economics, Kunming 650221, China.

出版信息

Bioinformatics. 2022 Jan 3;38(2):311-317. doi: 10.1093/bioinformatics/btab685.

DOI:10.1093/bioinformatics/btab685

PMID:34601562

Abstract

MOTIVATION

Knowing the number and the exact locations of multiple change points in genomic sequences serves several biological needs. The cumulative-segmented algorithm (cumSeg) has been recently proposed as a computationally efficient approach for multiple change-points detection, which is based on a simple transformation of data and provides results quite robust to model mis-specifications. However, the errors are also accumulated in the transformed model so that heteroscedasticity and serial correlation will show up, and thus the variations of the estimated change points will be quite different, while the locations of the change points should be of the same importance in the original genomic sequences.

RESULTS

In this study, we develop two new change-points detection procedures in the framework of cumulative segmented regression. Simulations reveal that the proposed methods not only improve the efficiency of each change point estimator substantially but also provide the estimators with similar variations for all the change points. By applying these proposed algorithms to Coriel and SNP genotyping data, we illustrate their performance on detecting copy number variations.

AVAILABILITY AND IMPLEMENTATION

The proposed algorithms are implemented in R program and the codes are provided in the online supplementary material.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

了解基因组序列中多个变化点的数量和确切位置可以满足多种生物学需求。累积分段算法（cumSeg）最近被提出作为一种计算效率高的多变化点检测方法，它基于数据的简单变换，并为模型误指定提供了相当稳健的结果。然而，错误也在变换模型中积累，因此异方差和序列相关性将会出现，因此估计的变化点的变化将非常不同，而变化点的位置在原始基因组序列中应该具有相同的重要性。