Konanov Dmitry N, Tereshchuk Vera Y, Sonets Ignat V, Korneenko Elena V, Lukina-Gronskaya Aleksandra V, Speranskaya Anna S, Ilina Elena N
Research Institute for System Biology and Medicine, Moscow 117246, Russia.
Phystech School of Biological and Medical Physics of MIPT, Moscow Institute of Physics and Technology, Dolgoprudny 141701, Russia.
Biology (Basel). 2025 Jun 9;14(6):670. doi: 10.3390/biology14060670.
DNA nanoball sequencing (DNBSEQ) is one of the most rapidly developing sequencing technologies and is widely applied in genomic and transcriptomic investigations. Recently, a new PE300 sequencing option primarily recommended for amplicon analysis was released for DNBSEQ-G99 and G400 devices. Given their unprecedentedly high data yield per flow cell, the new PE300 kits could be a great choice for various sequencing tasks, but we found that combining different types of DNA libraries in a single run could lead to undesired artifacts in the data. In this study, we investigate the occasional read cross-contamination that we first observed in our DNBSEQ PE300 run. The phenomenon, which we refer to as "software contamination", is not actual contamination but primarily manifests as improper forward/reverse read pairing, improper demultiplexing, or as "digital chimeric" reads. Although rare, these artifacts were found in all runs we have analyzed, including several MGI demo datasets (both PE100 and PE150). In this study, we demonstrate that these artifacts arise primarily from the incorrect resolution of sequencing signals produced by neighboring DNA nanoballs, leading to mixing out forward and reverse reads or improper demultiplexing. The artifacts occur most frequently with read pairs where the length of insert sequence is shorter than the read length. Based on a few external NA12878 human exome sequencing data, we conclude that the total improper pairing rate in DNBSEQ data is comparable to Illumina ones. Overall, the problem only affects the analysis results when simultaneously sequenced libraries have markedly different insert size distribution or flow cell loading. Additionally, we demonstrate here that raw DNBSEQ data might contain ~2% optical duplicates, resulting from the same effect of close neighboring of DNB-sites in the flow cell.
DNA纳米球测序(DNBSEQ)是发展最为迅速的测序技术之一,广泛应用于基因组和转录组研究。最近,一种主要推荐用于扩增子分析的新PE300测序选项已发布,可用于DNBSEQ - G99和G400设备。鉴于其每个流动槽前所未有的高数据产量,新的PE300试剂盒可能是各种测序任务的理想选择,但我们发现,在一次运行中合并不同类型的DNA文库可能会导致数据中出现不期望的伪像。在本研究中,我们调查了在DNBSEQ PE300运行中首次观察到的偶尔的读段交叉污染情况。我们将这种现象称为“软件污染”,它并非实际污染,主要表现为正向/反向读段配对不当、解复用不当或“数字嵌合”读段。尽管这种伪像很少见,但在我们分析的所有运行中都有发现,包括几个华大基因演示数据集(PE100和PE150)。在本研究中,我们证明这些伪像主要源于相邻DNA纳米球产生的测序信号分辨率不正确,导致正向和反向读段混淆或解复用不当。当插入序列长度短于读段长度时,伪像在配对读段中出现得最为频繁。基于一些外部NA12878人类外显子测序数据,我们得出结论,DNBSEQ数据中的总配对不当率与Illumina数据相当。总体而言,只有当同时测序的文库具有明显不同的插入片段大小分布或流动槽加载情况时,这个问题才会影响分析结果。此外,我们在此证明,原始DNBSEQ数据可能包含约2%的光学重复序列,这是由于流动槽中DNB位点紧密相邻的相同效应导致的。