Nishimura Osamu, Hara Yuichiro, Kuraku Shigehiro
Laboratory for Phyloinformatics, RIKEN Center for Biosystems Dynamics Research (BDR), Kobe, Japan.
Methods Mol Biol. 2019;1962:247-256. doi: 10.1007/978-1-4939-9173-0_15.
In daily practice of de novo genome assembly and gene prediction, it would be a natural urge to evaluate their products. Different programs and parameter settings give rise to variable outputs, which leaves a decision of which output to adopt for downstream analysis for addressing biological questions. Instead of superficial assessment of length-based statistics of output sequences (e.g., N50 scaffold length), completeness assessment by means of scoring the coverage of reference orthologs has been increasingly utilized.We previously launched a web service, gVolante ( https://gvolante.riken.jp /), to provide a user-friendly interface and a uniform environment for completeness assessment with the pipelines CEGMA and BUSCO. Completeness assessments performed on gVolante report scores based on not just the coverage of reference genes but also on sequence lengths, allowing quality control in multiple aspects. This chapter focuses on the procedure for such assessment and provides technical tips for higher accuracy.
在从头基因组组装和基因预测的日常实践中,评估其产物是一种自然而然的需求。不同的程序和参数设置会产生可变的输出结果,这就需要决定采用哪种输出结果用于下游分析以解决生物学问题。以往基于输出序列的长度统计(如N50支架长度)进行的表面评估已逐渐被淘汰,通过对参考直系同源基因的覆盖度进行评分来评估完整性的方法越来越受到青睐。我们之前推出了一个网络服务gVolante(https://gvolante.riken.jp/),为使用CEGMA和BUSCO管道进行完整性评估提供了一个用户友好的界面和统一的环境。在gVolante上进行的完整性评估不仅会根据参考基因的覆盖度报告分数,还会根据序列长度报告分数,从而实现多方面的质量控制。本章重点介绍这种评估的流程,并提供提高准确性的技术提示。