长读测序数据中全基因组扩增产生嵌合序列的探索。

Exploration of whole genome amplification generated chimeric sequences in long-read sequencing data.

机构信息

State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China.

Monash University-Southeast University Joint Research Institute, Suzhou 215123, China.

出版信息

Brief Bioinform. 2023 Sep 20;24(5). doi: 10.1093/bib/bbad275.

DOI:10.1093/bib/bbad275

PMID:37529913

Abstract

MOTIVATION

Multiple displacement amplification (MDA) has become the most commonly used method of whole genome amplification, generating a vast amount of DNA with higher molecular weight and greater genome coverage. Coupling with long-read sequencing, it is possible to sequence the amplicons of over 20 kb in length. However, the formation of chimeric sequences (chimeras, expressed as structural errors in sequencing data) in MDA seriously interferes with the bioinformatics analysis but its influence on long-read sequencing data is unknown.

RESULTS

We sequenced the phi29 DNA polymerase-mediated MDA amplicons on the PacBio platform and analyzed chimeras within the generated data. The 3rd-ChimeraMiner has been constructed as a pipeline for recognizing and restoring chimeras into the original structures in long-read sequencing data, improving the efficiency of using TGS data. Five long-read datasets and one high-fidelity long-read dataset with various amplification folds were analyzed. The result reveals that the mis-priming events in amplification are more frequently occurring than widely perceived, and the propor tion gradually accumulates from 42% to over 78% as the amplification continues. In total, 99.92% of recognized chimeric sequences were demonstrated to be artifacts, whose structures were wrongly formed in MDA instead of existing in original genomes. By restoring chimeras to their original structures, the vast majority of supplementary alignments that introduce false-positive structural variants are recycled, removing 97% of inversions on average and contributing to the analysis of structural variation in MDA-amplified samples. The impact of chimeras in long-read sequencing data analysis should be emphasized, and the 3rd-ChimeraMiner can help to quantify and reduce the influence of chimeras.

AVAILABILITY AND IMPLEMENTATION

The 3rd-ChimeraMiner is available on GitHub, https://github.com/dulunar/3rdChimeraMiner.

摘要

动机

多重置换扩增（MDA）已成为全基因组扩增最常用的方法，它可以产生分子量更高、基因组覆盖度更大的大量 DNA。与长读测序相结合，可以对 20kb 以上的扩增子进行测序。然而，MDA 中嵌合序列（嵌合体，在测序数据中表现为结构错误）的形成严重干扰了生物信息学分析，但它对长读测序数据的影响尚不清楚。

结果

我们在 PacBio 平台上对 phi29 DNA 聚合酶介导的 MDA 扩增子进行了测序，并分析了生成数据中的嵌合体。3rd-ChimeraMiner 已被构建为一种用于识别和还原长读测序数据中嵌合体的流水线，从而提高了 TGS 数据的利用效率。我们分析了五个长读数据集和一个具有不同扩增倍数的高保真长读数据集。结果表明，扩增中的错误引发事件比人们普遍认为的更为频繁，并且随着扩增的进行，错误引发事件的比例逐渐从 42%累积到 78%以上。总共识别出的 99.92%的嵌合序列被证明是人为产物，它们的结构是在 MDA 中错误形成的，而不是原始基因组中存在的。通过将嵌合体还原到其原始结构，可以回收大量引入假阳性结构变异的补充比对，平均消除 97%的倒位，有助于 MDA 扩增样本的结构变异分析。嵌合体在长读测序数据分析中的影响应引起重视，而 3rd-ChimeraMiner 可以帮助量化并减少嵌合体的影响。

可用性和实现

3rd-ChimeraMiner 可在 GitHub 上获得，网址为 https://github.com/dulunar/3rdChimeraMiner。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

长读测序数据中全基因组扩增产生嵌合序列的探索。

Exploration of whole genome amplification generated chimeric sequences in long-read sequencing data.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

动机

结果

可用性和实现

相似文献

引用本文的文献

长读测序数据中全基因组扩增产生嵌合序列的探索。

Exploration of whole genome amplification generated chimeric sequences in long-read sequencing data.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

动机

结果

可用性和实现

相似文献

引用本文的文献