Suppr超能文献

通过社区编辑生成的新型且经过改进的基因模型。

Novel and improved gene models generated by community curation.

作者信息

Moya Nicolas D, Stevens Lewis, Miller Isabella R, Sokol Chloe E, Galindo Joseph L, Bardas Alexandra D, Koh Edward S H, Rozenich Justine, Yeo Cassia, Xu Maryanne, Andersen Erik C

机构信息

Department of Molecular Biosciences, Northwestern University, Evanston, IL 60208, USA.

Interdisciplinary Biological Sciences Program, Northwestern University, Evanston, IL 60208, USA.

出版信息

bioRxiv. 2023 May 18:2023.05.16.541014. doi: 10.1101/2023.05.16.541014.

Abstract

BACKGROUND

The nematode has been used as a model for genomics studies compared to because of its striking morphological and behavioral similarities. These studies yielded numerous findings that have expanded our understanding of nematode development and evolution. However, the potential of to study nematode biology is limited by the quality of its genome resources. The reference genome and gene models for the laboratory strain AF16 have not been developed to the same extent as . The recent publication of a new chromosome-level reference genome for QX1410, a wild strain closely related to AF16, has provided the first step to bridge the gap between and genome resources. Currently, the QX1410 gene models consist of protein-coding gene predictions generated from short- and long-read transcriptomic data. Because of the limitations of gene prediction software, the existing gene models for QX1410 contain numerous errors in their structure and coding sequences. In this study, a team of researchers manually inspected over 21,000 software-derived gene models and underlying transcriptomic data to improve the protein-coding gene models of the QX1410 genome.

RESULTS

We designed a detailed workflow to train a team of nine students to manually curate genes using RNA read alignments and predicted gene models. We manually inspected the gene models using the genome annotation editor, Apollo, and proposed corrections to the coding sequences of over 8,000 genes. Additionally, we modeled thousands of putative isoforms and untranslated regions. We exploited the conservation of protein sequence length between and to quantify the improvement in protein-coding gene model quality before and after curation. Manual curation led to a substantial improvement in the protein sequence length accuracy of QX1410 genes. We also compared the curated QX1410 gene models against the existing AF16 gene models. The manual curation efforts yielded QX1410 gene models that are similar in quality to the extensively curated AF16 gene models in terms of protein-length accuracy and biological completeness scores. Collinear alignment analysis between the QX1410 and AF16 genomes revealed over 1,800 genes affected by spurious duplications and inversions in the AF16 genome that are now resolved in the QX1410 genome.

CONCLUSIONS

Community-based, manual curation using transcriptome data is an effective approach to improve the quality of software-derived protein-coding genes. Comparative genomic analysis using a related species with high-quality reference genome(s) and gene models can be used to quantify improvements in gene model quality in a newly sequenced genome. The detailed protocols provided in this work can be useful for future large-scale manual curation projects in other species. The chromosome-level reference genome for the strain QX1410 far surpasses the quality of the genome of the laboratory strain AF16, and our manual curation efforts have brought the QX1410 gene models to a comparable level of quality to the previous reference, AF16. The improved genome resources for provide reliable tools for the study of biology and other related nematodes.

摘要

背景

与[其他线虫]相比,[该线虫]因其显著的形态和行为相似性,已被用作基因组学研究的模型。这些研究产生了许多发现,扩展了我们对线虫发育和进化的理解。然而,[该线虫]用于研究线虫生物学的潜力受到其基因组资源质量的限制。实验室菌株AF16的参考基因组和基因模型的开发程度不及[其他线虫]。最近公布了与AF16密切相关的野生菌株QX1410的新染色体水平参考基因组,这为弥合[该线虫]和[其他线虫]基因组资源之间的差距迈出了第一步。目前,QX1410基因模型由从短读长和长读长转录组数据生成的蛋白质编码基因预测组成。由于基因预测软件的局限性,现有的QX1410基因模型在结构和编码序列中存在大量错误。在本研究中,一组研究人员手动检查了超过21,000个软件衍生的基因模型和基础转录组数据,以改进[该线虫]QX1410基因组的蛋白质编码基因模型。

结果

我们设计了一个详细的工作流程,培训了九名学生组成的团队,使用RNA读段比对和预测的基因模型手动整理基因。我们使用基因组注释编辑器Apollo手动检查基因模型,并对8000多个基因的编码序列提出了修正。此外,我们对数千个推定的异构体和非翻译区进行了建模。我们利用[该线虫]和[其他线虫]之间蛋白质序列长度的保守性来量化整理前后蛋白质编码基因模型质量的提高。手动整理显著提高了QX1410基因的蛋白质序列长度准确性。我们还将整理后的QX1410基因模型与现有的AF16基因模型进行了比较。手动整理工作产生的QX1410基因模型在蛋白质长度准确性和生物学完整性评分方面与经过广泛整理的AF16基因模型质量相似。QX1410和AF16基因组之间的共线比对分析揭示了AF16基因组中超过1800个受假重复和倒位影响的基因,这些基因现在在QX1410基因组中得到了解决。

结论

基于群体的、使用转录组数据的手动整理是提高软件衍生的蛋白质编码基因质量的有效方法。使用具有高质量参考基因组和基因模型的相关物种进行比较基因组分析,可用于量化新测序基因组中基因模型质量的提高。本研究提供的详细方案可用于未来其他物种的大规模手动整理项目。[该线虫]菌株QX1410的染色体水平参考基因组远远超过实验室菌株AF16的基因组质量,我们的手动整理工作使QX1410基因模型的质量达到了与先前参考基因组AF16相当的水平。[该线虫]改进后的基因组资源为研究[该线虫]生物学和其他相关线虫提供了可靠的工具。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5949/10245686/f90d60ad5f06/nihpp-2023.05.16.541014v1-f0001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验