Centre of Biological Engineering, University of Minho, 4710-057, Braga, Portugal.
Instituto de Tecnologia Química e Biológica, Universidade Nova de Lisboa, 2780-157, Oeiras, Portugal.
BMC Bioinformatics. 2019 Sep 5;20(1):454. doi: 10.1186/s12859-019-3038-4.
As genome sequencing projects grow rapidly, the diversity of organisms with recently assembled genome sequences peaks at an unprecedented scale, thereby highlighting the need to make gene functional annotations fast and efficient. However, the (high) quality of such annotations must be guaranteed, as this is the first indicator of the genomic potential of every organism. Automatic procedures help accelerating the annotation process, though decreasing the confidence and reliability of the outcomes. Manually curating a genome-wide annotation of genes, enzymes and transporter proteins function is a highly time-consuming, tedious and impractical task, even for the most proficient curator. Hence, a semi-automated procedure, which balances the two approaches, will increase the reliability of the annotation, while speeding up the process. In fact, a prior analysis of the annotation algorithm may leverage its performance, by manipulating its parameters, hastening the downstream processing and the manual curation of assigning functions to genes encoding proteins.
Here SamPler, a novel strategy to select parameters for gene functional annotation routines is presented. This semi-automated method is based on the manual curation of a randomly selected set of genes/proteins. Then, in a multi-dimensional array, this sample is used to assess the automatic annotations for all possible combinations of the algorithm's parameters. These assessments allow creating an array of confusion matrices, for which several metrics are calculated (accuracy, precision and negative predictive value) and used to reach optimal values for the parameters.
The potential of this methodology is demonstrated with four genome functional annotations performed in merlin, an in-house user-friendly computational framework for genome-scale metabolic annotation and model reconstruction. For that, SamPler was implemented as a new plugin for the merlin tool.
随着基因组测序项目的快速发展,具有最近组装基因组序列的生物体的多样性达到了前所未有的规模,因此需要快速有效地进行基因功能注释。然而,必须保证这些注释的(高)质量,因为这是每个生物体基因组潜力的第一个指标。自动程序有助于加速注释过程,尽管降低了结果的置信度和可靠性。手动编目基因、酶和转运蛋白功能的全基因组注释是一项高度耗时、乏味且不切实际的任务,即使对于最熟练的编目人员也是如此。因此,一种平衡两种方法的半自动化程序将提高注释的可靠性,同时加快进程。事实上,通过操纵参数对注释算法进行预先分析,可以提高其性能,从而加快下游处理和为编码蛋白的基因分配功能的手动编目过程。
本文提出了一种新的基因功能注释例程参数选择策略 SamPler。这种半自动方法基于对一组随机选择的基因/蛋白质进行手动编目。然后,在多维数组中,使用该样本评估算法参数的所有可能组合的自动注释。这些评估允许创建一个混淆矩阵数组,其中计算了几个指标(准确性、精度和负预测值),并用于为参数找到最佳值。
该方法的潜力在 Merlin 中进行的四个基因组功能注释中得到了证明,Merlin 是一个用于基因组规模代谢注释和模型重建的用户友好的计算框架。为此,SamPler 被实现为 Merlin 工具的一个新插件。