Department of Information Engineering, University of Florence, 50100, Florence, Italy.
Institute for Biomedical Technologies, National Research Council, Segrate, Milan, Italy.
Sci Rep. 2023 Nov 27;13(1):20817. doi: 10.1038/s41598-023-48285-0.
Long-read sequencing allows analyses of single nucleic-acid molecules and produces sequences in the order of tens to hundreds kilobases. Its application to whole-genome analyses allows identification of complex genomic structural-variants (SVs) with unprecedented resolution. SV identification, however, requires complex computational methods, based on either read-depth or intra- and inter-alignment signatures approaches, which are limited by size or type of SVs. Moreover, most currently available tools only detect germline variants, thus requiring separate computation of sample pairs for comparative analyses. To overcome these limits, we developed a novel tool (Germline And SOmatic structuraL varIants detectioN and gEnotyping; GASOLINE) that groups SV signatures using a sophisticated clustering procedure based on a modified reciprocal overlap criterion, and is designed to identify germline SVs, from single samples, and somatic SVs from paired test and control samples. GASOLINE is a collection of Perl, R and Fortran codes, it analyzes aligned data in BAM format and produces VCF files with statistically significant somatic SVs. Germline or somatic analysis of 30[Formula: see text] sequencing coverage experiments requires 4-5 h with 20 threads. GASOLINE outperformed currently available methods in the detection of both germline and somatic SVs in synthetic and real long-reads datasets. Notably, when applied on a pair of metastatic melanoma and matched-normal sample, GASOLINE identified five genuine somatic SVs that were missed using five different sequencing technologies and state-of-the art SV calling approaches. Thus, GASOLINE identifies germline and somatic SVs with unprecedented accuracy and resolution, outperforming currently available state-of-the-art WGS long-reads computational methods.
长读测序可以分析单个核酸分子,并以数十到数百千碱基的顺序产生序列。将其应用于全基因组分析,可以以前所未有的分辨率识别复杂的基因组结构变异(SV)。然而,SV 的识别需要基于读深或内、外对齐特征方法的复杂计算方法,这些方法受到 SV 大小或类型的限制。此外,大多数现有的工具只能检测种系变体,因此需要对样本对进行单独计算以进行比较分析。为了克服这些限制,我们开发了一种新的工具(种系和体细胞结构变异检测和基因分型;GASOLINE),该工具使用基于修改的互重叠标准的复杂聚类程序对 SV 特征进行分组,旨在从单个样本中识别种系 SV,并从配对的测试和对照样本中识别体细胞 SV。GASOLINE 是一个 Perl、R 和 Fortran 代码的集合,它分析以 BAM 格式对齐的数据,并生成具有统计学意义的体细胞 SV 的 VCF 文件。用 30[Formula: see text]测序覆盖实验进行种系或体细胞分析需要 20 个线程,4-5 小时。GASOLINE 在检测合成和真实长读数据集的种系和体细胞 SV 方面优于当前可用的方法。值得注意的是,当应用于一对转移性黑色素瘤和匹配的正常样本时,GASOLINE 鉴定了五个真正的体细胞 SV,而使用五种不同的测序技术和最先进的 SV 调用方法则错过了这些 SV。因此,GASOLINE 以前所未有的准确性和分辨率鉴定种系和体细胞 SV,优于当前可用的最先进的 WGS 长读计算方法。