Suppr
超能文献

RegScaff：一种用于支架搭建的回归方法。

RegScaf: a regression approach to scaffolding.

机构信息

National Center of Mathematics and Interdisciplinary Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China.

University of Chinese Academy of Sciences, Beijing 100049, China.

出版信息

Bioinformatics. 2022 May 13;38(10):2675-2682. doi: 10.1093/bioinformatics/btac174.

DOI:10.1093/bioinformatics/btac174

PMID:35561180

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9326850/

Abstract

MOTIVATION

Crucial to the correctness of a genome assembly is the accuracy of the underlying scaffolds that specify the orders and orientations of contigs together with the gap distances between contigs. The current methods construct scaffolds based on the alignments of 'linking' reads against contigs. We found that some 'optimal' alignments are mistaken due to factors such as the contig boundary effect, particularly in the presence of repeats. Occasionally, the incorrect alignments can even overwhelm the correct ones. The detection of the incorrect linking information is challenging in any existing methods.

RESULTS

In this study, we present a novel scaffolding method RegScaf. It first examines the distribution of distances between contigs from read alignment by the kernel density. When multiple modes are shown in a density, orientation-supported links are grouped into clusters, each of which defines a linking distance corresponding to a mode. The linear model parameterizes contigs by their positions on the genome; then each linking distance between a pair of contigs is taken as an observation on the difference of their positions. The parameters are estimated by minimizing a global loss function, which is a version of trimmed sum of squares. The least trimmed squares estimate has such a high breakdown value that it can automatically remove the mistaken linking distances. The results on both synthetic and real datasets demonstrate that RegScaf outperforms some popular scaffolders, especially in the accuracy of gap estimates by substantially reducing extremely abnormal errors. Its strength in resolving repeat regions is exemplified by a real case. Its adaptability to large genomes and TGS long reads is validated as well.

AVAILABILITY AND IMPLEMENTATION

RegScaf is publicly available at https://github.com/lemontealala/RegScaf.git.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

基因组组装的正确性关键在于指定 contigs 顺序和方向以及 contigs 之间间隙距离的底层支架的准确性。当前的方法基于“连接”读取与 contigs 的比对来构建支架。我们发现，由于诸如 contig 边界效应等因素，一些“最佳”比对是错误的，特别是在存在重复的情况下。偶尔，错误的比对甚至会压倒正确的比对。在任何现有方法中，检测不正确的连接信息都是具有挑战性的。

结果

在这项研究中，我们提出了一种新颖的支架方法 RegScaf。它首先通过核密度检查来自读取比对的 contigs 之间距离的分布。当在密度中显示多个模式时，支持方向的链接被分组到簇中，每个簇定义与模式对应的链接距离。线性模型通过其在基因组上的位置对 contigs 进行参数化；然后，将一对 contigs 之间的每个链接距离视为它们位置差异的一个观测值。通过最小化全局损失函数来估计参数，该函数是修剪平方和的一个版本。最小修剪平方和估计具有如此高的破坏值，以至于它可以自动删除错误的链接距离。在合成和真实数据集上的结果表明，RegScaf 优于一些流行的支架，尤其是通过大大减少极其异常的误差来提高间隙估计的准确性。通过一个真实案例说明了它在解决重复区域方面的优势。还验证了它对大型基因组和 TGS 长读取的适应性。