Suppr超能文献

基于累积分段回归的基因组序列高效突变点检测

Efficient change-points detection for genomic sequences via cumulative segmented regression.

机构信息

School of Statistics and Mathematics; Interdisciplinary Research Institute of Data Science, Shanghai Lixin University of Accounting and Finance, Shanghai 201209, China.

Statistics and Mathematics School, Yunnan University of Finance and Economics, Kunming 650221, China.

出版信息

Bioinformatics. 2022 Jan 3;38(2):311-317. doi: 10.1093/bioinformatics/btab685.

Abstract

MOTIVATION

Knowing the number and the exact locations of multiple change points in genomic sequences serves several biological needs. The cumulative-segmented algorithm (cumSeg) has been recently proposed as a computationally efficient approach for multiple change-points detection, which is based on a simple transformation of data and provides results quite robust to model mis-specifications. However, the errors are also accumulated in the transformed model so that heteroscedasticity and serial correlation will show up, and thus the variations of the estimated change points will be quite different, while the locations of the change points should be of the same importance in the original genomic sequences.

RESULTS

In this study, we develop two new change-points detection procedures in the framework of cumulative segmented regression. Simulations reveal that the proposed methods not only improve the efficiency of each change point estimator substantially but also provide the estimators with similar variations for all the change points. By applying these proposed algorithms to Coriel and SNP genotyping data, we illustrate their performance on detecting copy number variations.

AVAILABILITY AND IMPLEMENTATION

The proposed algorithms are implemented in R program and the codes are provided in the online supplementary material.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

了解基因组序列中多个变化点的数量和确切位置可以满足多种生物学需求。累积分段算法(cumSeg)最近被提出作为一种计算效率高的多变化点检测方法,它基于数据的简单变换,并为模型误指定提供了相当稳健的结果。然而,错误也在变换模型中积累,因此异方差和序列相关性将会出现,因此估计的变化点的变化将非常不同,而变化点的位置在原始基因组序列中应该具有相同的重要性。

结果

在这项研究中,我们在累积分段回归框架中开发了两种新的变化点检测程序。模拟结果表明,所提出的方法不仅可以大大提高每个变化点估计器的效率,而且还可以为所有变化点提供相似的变化估计器。通过将这些建议的算法应用于 Coriel 和 SNP 基因分型数据,我们说明了它们在检测拷贝数变异方面的性能。

可用性和实现

所提出的算法是用 R 程序实现的,代码在在线补充材料中提供。

补充信息

补充数据可在生物信息学在线获得。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验