Systems and Computing Engineering Department, Universidad de Los Andes, Bogotá 111711, Colombia.
Gigascience. 2024 Jan 2;13. doi: 10.1093/gigascience/giad112.
Structural variants (SVs) are genomic polymorphisms defined by their length (>50 bp). The usual types of SVs are deletions, insertions, translocations, inversions, and copy number variants. SV detection and genotyping is fundamental given the role of SVs in phenomena such as phenotypic variation and evolutionary events. Thus, methods to identify SVs using long-read sequencing data have been recently developed.
We present an accurate and efficient algorithm to predict germline SVs from long-read sequencing data. The algorithm starts collecting evidence (signatures) of SVs from read alignments. Then, signatures are clustered based on a Euclidean graph with coordinates calculated from lengths and genomic positions. Clustering is performed by the DBSCAN algorithm, which provides the advantage of delimiting clusters with high resolution. Clusters are transformed into SVs and a Bayesian model allows to precisely genotype SVs based on their supporting evidence. This algorithm is integrated into the single sample variants detector of the Next Generation Sequencing Experience Platform, which facilitates the integration with other functionalities for genomics analysis. We performed multiple benchmark experiments, including simulation and real data, representing different genome profiles, sequencing technologies (PacBio HiFi, ONT), and read depths.
The results show that our approach outperformed state-of-the-art tools on germline SV calling and genotyping, especially at low depths, and in error-prone repetitive regions. We believe this work significantly contributes to the development of bioinformatic strategies to maximize the use of long-read sequencing technologies.
结构变异(SVs)是指长度大于 50bp 的基因组多态性。SV 的常见类型包括缺失、插入、易位、倒位和拷贝数变异。鉴于 SV 在表型变异和进化事件等现象中的作用,SV 的检测和基因分型至关重要。因此,最近已经开发了使用长读测序数据识别 SV 的方法。
我们提出了一种从长读测序数据中预测种系 SV 的准确有效的算法。该算法从读取比对开始收集 SV 的证据(特征)。然后,根据从长度和基因组位置计算的坐标的欧几里得图对特征进行聚类。聚类通过 DBSCAN 算法执行,该算法提供了以高分辨率限定聚类的优势。聚类转换为 SV,贝叶斯模型允许根据其支持证据精确地对 SV 进行基因分型。该算法集成到下一代测序体验平台的单个样本变体检测器中,便于与其他基因组分析功能集成。我们进行了多次基准实验,包括模拟和真实数据,代表不同的基因组特征、测序技术(PacBio HiFi、ONT)和读取深度。
结果表明,我们的方法在种系 SV 调用和基因分型方面优于最先进的工具,尤其是在深度较低且易出错的重复区域。我们相信这项工作为开发最大限度利用长读测序技术的生物信息学策略做出了重大贡献。