Laboratory of Human Retrovirology and Immunoinformatics, Frederick National Laboratory for Cancer Research, Frederick, MD 21702, USA.
Laboratory of Immunoregulation, National Institute of Allergy and Infectious Diseases, Bethesda, MD 20892, USA.
Bioinformatics. 2022 Jun 13;38(12):3192-3199. doi: 10.1093/bioinformatics/btac313.
The existence of quasispecies in the viral population causes difficulties for disease prevention and treatment. High-throughput sequencing provides opportunity to determine rare quasispecies and long sequencing reads covering full genomes reduce quasispecies determination to a clustering problem. The challenge is high similarity of quasispecies and high error rate of long sequencing reads.
We developed QuasiSeq using a novel signature-based self-tuning clustering method, SigClust, to profile viral mixtures with high accuracy and sensitivity. QuasiSeq can correctly identify quasispecies even using low-quality sequencing reads (accuracy <80%) and produce quasispecies sequences with high accuracy (≥99.55%). Using high-quality circular consensus sequencing reads, QuasiSeq can produce quasispecies sequences with 100% accuracy. QuasiSeq has higher sensitivity and specificity than similar published software. Moreover, the requirement of the computational resource can be controlled by the size of the signature, which makes it possible to handle big sequencing data for rare quasispecies discovery. Furthermore, parallel computation is implemented to process the clusters and further reduce the runtime. Finally, we developed a web interface for the QuasiSeq workflow with simple parameter settings based on the quality of sequencing data, making it easy to use for users without advanced data science skills.
QuasiSeq is open source and freely available at https://github.com/LHRI-Bioinformatics/QuasiSeq. The current release (v1.0.0) is archived and available at https://zenodo.org/badge/latestdoi/340494542.
Supplementary data are available at Bioinformatics online.
病毒群体中的准种存在给疾病的预防和治疗带来了困难。高通量测序为确定稀有准种提供了机会,而覆盖全基因组的长测序读长将准种确定减少到聚类问题。挑战在于准种的高度相似性和长测序读长的高错误率。
我们使用了一种新颖的基于签名的自调整聚类方法 SigClust 来开发 QuasiSeq,以高精度和高灵敏度来描绘病毒混合物。QuasiSeq 甚至可以在低质量测序读长(准确率 <80%)的情况下正确识别准种,并生成具有高精度(≥99.55%)的准种序列。使用高质量的环形一致测序读长,QuasiSeq 可以生成准确率为 100%的准种序列。QuasiSeq 比类似的已发表软件具有更高的灵敏度和特异性。此外,计算资源的要求可以通过签名的大小来控制,这使得处理稀有准种发现的大数据量成为可能。此外,还实现了并行计算来处理聚类,进一步缩短了运行时间。最后,我们开发了一个基于测序数据质量的简单参数设置的 QuasiSeq 工作流程的网络界面,使没有高级数据科学技能的用户也易于使用。
QuasiSeq 是开源的,可以在 https://github.com/LHRI-Bioinformatics/QuasiSeq 上免费获得。当前版本(v1.0.0)已存档并可在 https://zenodo.org/badge/latestdoi/340494542 上获得。
补充数据可在 Bioinformatics 在线获取。