Cao Xiaolong, Jiang Haobo
Department of Entomology and Plant Pathology, Oklahoma State University, Stillwater, OK 74078, USA.
Department of Entomology and Plant Pathology, Oklahoma State University, Stillwater, OK 74078, USA.
Insect Biochem Mol Biol. 2015 Jul;62:2-10. doi: 10.1016/j.ibmb.2015.01.007. Epub 2015 Jan 20.
The genome sequence of Manduca sexta was recently determined using 454 technology. Cufflinks and MAKER2 were used to establish gene models in the genome assembly based on the RNA-Seq data and other species' sequences. Aided by the extensive RNA-Seq data from 50 tissue samples at various life stages, annotators over the world (including the present authors) have manually confirmed and improved a small percentage of the models after spending months of effort. While such collaborative efforts are highly commendable, many of the predicted genes still have problems which may hamper future research on this insect species. As a biochemical model representing lepidopteran pests, M. sexta has been used extensively to study insect physiological processes for over five decades. In this work, we assembled Manduca datasets Cufflinks 3.0, Trinity 4.0, and Oases 4.0 to assist the manual annotation efforts and development of Official Gene Set (OGS) 2.0. To further improve annotation quality, we developed methods to evaluate gene models in the MAKER2, Cufflinks, Oases and Trinity assemblies and selected the best ones to constitute MCOT 1.0 after thorough crosschecking. MCOT 1.0 has 18,089 genes encoding 31,666 proteins: 32.8% match OGS 2.0 models perfectly or near perfectly, 11,747 differ considerably, and 29.5% are absent in OGS 2.0. Future automation of this process is anticipated to greatly reduce human efforts in generating comprehensive, reliable models of structural genes in other genome projects where extensive RNA-Seq data are available.
烟草天蛾的基因组序列最近利用454技术测定完成。基于RNA测序数据和其他物种的序列,使用Cufflinks和MAKER2在基因组组装中建立基因模型。在来自不同生命阶段的50个组织样本的大量RNA测序数据的辅助下,世界各地的注释者(包括本文作者)经过数月努力,人工确认并改进了一小部分模型。虽然这种合作努力非常值得称赞,但许多预测基因仍然存在问题,这可能会妨碍对这种昆虫物种的未来研究。作为鳞翅目害虫的生化模型,烟草天蛾已被广泛用于研究昆虫生理过程五十多年。在这项工作中,我们组装了烟草天蛾数据集Cufflinks 3.0、Trinity 4.0和Oases 4.0,以协助人工注释工作和官方基因集(OGS)2.0的开发。为了进一步提高注释质量,我们开发了评估MAKER2、Cufflinks、Oases和Trinity组装中基因模型的方法,并在彻底交叉检查后选择最佳模型组成MCOT 1.0。MCOT 1.0有18,089个基因编码31,666种蛋白质:32.8%与OGS 2.0模型完美或近乎完美匹配,11,747个差异很大,29.5%在OGS 2.0中不存在。预计该过程未来的自动化将大大减少在其他有大量RNA测序数据的基因组项目中生成全面、可靠的结构基因模型时的人力投入。