Institute of Evolutionary Biology, University of Edinburgh, Edinburgh EH9 3JT, UK.
Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California, USA.
Gigascience. 2018 Apr 1;7(4). doi: 10.1093/gigascience/giy034.
Genome assembly and annotation remain exacting tasks. As the tools available for these tasks improve, it is useful to return to data produced with earlier techniques to assess their credibility and correctness. The entomopathogenic nematode Heterorhabditis bacteriophora is widely used to control insect pests in horticulture. The genome sequence for this species was reported to encode an unusually high proportion of unique proteins and a paucity of secreted proteins compared to other related nematodes.
We revisited the H. bacteriophora genome assembly and gene predictions to determine whether these unusual characteristics were biological or methodological in origin. We mapped an independent resequencing dataset to the genome and used the blobtools pipeline to identify potential contaminants. While present (0.2% of the genome span, 0.4% of predicted proteins), assembly contamination was not significant.
Re-prediction of the gene set using BRAKER1 and published transcriptome data generated a predicted proteome that was very different from the published one. The new gene set had a much reduced complement of unique proteins, better completeness values that were in line with other related species' genomes, and an increased number of proteins predicted to be secreted. It is thus likely that methodological issues drove the apparent uniqueness of the initial H. bacteriophora genome annotation and that similar contamination and misannotation issues affect other published genome assemblies.
基因组组装和注释仍然是一项艰巨的任务。随着这些任务的工具不断改进,有必要利用早期技术生成的数据来评估其可信度和正确性。昆虫病原线虫异小杆线虫被广泛用于园艺中防治害虫。该物种的基因组序列被报道编码了异常高比例的独特蛋白质,与其他相关线虫相比,分泌蛋白的数量很少。
我们重新审视了 H. bacteriophora 基因组组装和基因预测,以确定这些不寻常的特征是源于生物学还是方法学。我们将一个独立的重测序数据集映射到基因组上,并使用 blobtools 管道来识别潜在的污染物。虽然存在(基因组跨度的 0.2%,预测蛋白质的 0.4%),但组装污染并不显著。
使用 BRAKER1 和已发表的转录组数据重新预测基因集生成了一个与已发表的基因集非常不同的预测蛋白质组。新的基因集独特蛋白质的数量大大减少,完整性值更好,与其他相关物种的基因组一致,并且预测分泌的蛋白质数量增加。因此,最初 H. bacteriophora 基因组注释的明显独特性很可能是由于方法学问题引起的,并且类似的污染和错误注释问题可能会影响其他已发表的基因组组装。