Landesfeind Manuel, Meinicke Peter
Department of Bioinformatics, Institute for Microbiology and Genetics, Georg-August-University, Goldschmidtstraße 1, 37077 Göttingen, DE, Germany.
BMC Genomics. 2014 Nov 20;15(1):1003. doi: 10.1186/1471-2164-15-1003.
The annotation of biomolecular functions is an essential step in the analysis of newly sequenced organisms. Usually, the functions are inferred from predicted genes on the genome using homology search techniques. A high quality genomic sequence is an important prerequisite which, however, is difficult to achieve for certain organisms, such as hybrids or organisms with a large genome. For functional analysis it is also possible to use a de novo transcriptome assembly but the computational requirements can be demanding. Up to now, it is unclear how much of the functional repertoire of an organism can be reliably predicted from unassembled RNA-seq short reads alone.
We have conducted a study to investigate to what degree it is possible to reconstruct the functional profile of an organism from unassembled transcriptome data. We simulated the de novo prediction of biomolecular functions for Arabidopsis thaliana using a comprehensive RNA-seq data set. We evaluated the prediction performance using several homology search methods in combination with different evidence measures. For the decision on the presence or absence of a particular function under noisy conditions we propose a statistical mixture model enabling unsupervised estimation of a detection threshold. Our results indicate that the prediction of the biomolecular functions from the KEGG database is possible with a high sensitivity up to 94 percent. In this setting, the application of the mixture model for automatic threshold calibration allowed the reduction of the falsely predicted functions down to 4 percent. Furthermore, we found that our statistical approach even outperforms the prediction from a de novo transcriptome assembly.
The analysis of an organism's transcriptome can provide a solid basis for the prediction of biomolecular functions. Using RNA-seq short reads directly, the functional profile of an organism can be reconstructed in a computationally efficient way to provide a draft annotation in cases where the classical genome-based approaches cannot be applied.
生物分子功能注释是新测序生物体分析中的关键步骤。通常,功能是通过同源性搜索技术从基因组上预测的基因中推断出来的。高质量的基因组序列是一个重要前提条件,但对于某些生物体,如杂交体或基因组庞大的生物体而言,这很难实现。对于功能分析,也可以使用从头转录组组装,但计算要求可能很高。到目前为止,尚不清楚仅从未组装的RNA-seq短读段中能够可靠预测生物体功能库的程度。
我们开展了一项研究,以探究从未组装的转录组数据重建生物体功能概况的可能性。我们使用一个全面的RNA-seq数据集模拟了拟南芥生物分子功能的从头预测。我们结合不同的证据度量,使用几种同源性搜索方法评估了预测性能。对于在噪声条件下特定功能是否存在的判定,我们提出了一种统计混合模型,能够对检测阈值进行无监督估计。我们的结果表明,从KEGG数据库中预测生物分子功能具有高达94%的高灵敏度。在此情况下,应用混合模型进行自动阈值校准可将错误预测的功能减少至4%。此外,我们发现我们的统计方法甚至优于从头转录组组装的预测。
对生物体转录组的分析可为生物分子功能的预测提供坚实基础。直接使用RNA-seq短读段,可以以计算高效的方式重建生物体的功能概况,以便在无法应用基于经典基因组的方法时提供注释草案。