Institute of Information Science, Academia Sinica, Taipei, Taiwan.
Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan.
BMC Genomics. 2019 Apr 18;19(Suppl 9):238. doi: 10.1186/s12864-019-5445-3.
With the rapid increase in genome sequencing projects for non-model organisms, numerous genome assemblies are currently in progress or available as drafts, but not made available as satisfactory, usable genomes. Data quality assessment of genome assemblies is gaining importance not only for people who perform the assembly/re-assembly processes, but also for those who attempt to use assemblies as maps in downstream analyses. Recent studies of the quality control, quality evaluation/ assessment of genome assemblies have focused on either quality control of reads before assemblies or evaluation of the assemblies with respect to their contiguity and correctness. However, correctness assessment depends on a reference and is not applicable for de novo assembly projects. Hence, development of methods providing both post-assembly and pre-assembly quality assessment reports for examining the quality/correctness of de novo assemblies and the input reads is worth studying.
We present SQUAT, an efficient tool for both pre-assembly and post-assembly quality assessment of de novo genome assemblies. The pre-assembly module of SQUAT computes quality statistics of reads and presents the analysis in a well-designed interface to visualize the distribution of high- and poor-quality reads in a portable HTML report. The post-assembly module of SQUAT provides read mapping analytics in an HTML format. We categorized reads into several groups including uniquely mapped reads, multiply mapped, unmapped reads; for uniquely mapped reads, we further categorized them into perfectly matched, with substitutions, containing clips, and the others. We carefully defined the poorly mapped (PM) reads into several groups to prevent the underestimation of unmapped reads; indeed, a high PM% would be a sign of a poor assembly that requires researchers' attention for further examination or improvements before using the assembly. Finally, we evaluate SQUAT with six datasets, including the genome assemblies for eel, worm, mushroom, and three bacteria. The results show that SQUAT reports provide useful information with details for assessing the quality of assemblies and reads.
The SQUAT software with links to both its docker image and the on-line manual is freely available at https://github.com/luke831215/SQUAT .
随着非模式生物基因组测序项目的快速增加,目前正在进行或提供大量基因组草案,但这些草案并不能作为令人满意的、可用的基因组。不仅对于执行组装/重新组装过程的人,而且对于那些试图将组装用作下游分析图的人来说,基因组组装的数据质量评估变得越来越重要。最近的基因组组装质量控制和质量评估研究主要集中在组装前的读取质量控制或组装的连续性和正确性评估上。然而,正确性评估依赖于参考,不适用于从头组装项目。因此,开发提供从头组装和组装后质量评估报告的方法,用于检查从头组装和输入读取的质量/正确性,是值得研究的。
我们提出了 SQUAT,这是一种用于从头基因组组装的预组装和后组装质量评估的高效工具。SQUAT 的预组装模块计算读取的质量统计信息,并在一个设计良好的界面中呈现分析结果,以可视化高和低质量读取在可移植 HTML 报告中的分布。SQUAT 的后组装模块以 HTML 格式提供读取映射分析。我们将读取分为几类,包括唯一映射读取、多次映射读取、未映射读取;对于唯一映射读取,我们进一步将它们分为完全匹配、有替换、包含剪辑和其他。我们仔细将未完全映射(PM)读取分为几类,以防止未映射读取的低估;事实上,高 PM%是组装质量差的标志,需要研究人员注意,在使用组装之前进行进一步检查或改进。最后,我们用六个数据集评估了 SQUAT,包括鳗鱼、蠕虫、蘑菇和三个细菌的基因组组装。结果表明,SQUAT 报告提供了有用的信息,详细评估了组装和读取的质量。
SQUAT 软件及其 docker 镜像和在线手册的链接可在 https://github.com/luke831215/SQUAT 上免费获得。