Castelo Robert, Guigó Roderic
Grup de Recerca en Informàtica Biomèdica, Institut Municipal d'Investigació Mèdica, Universitat Pompeu Fabra, Centre de Regulació Genòmica, Psg. Marítim 37-49, Barcelona, Spain.
Bioinformatics. 2004 Aug 4;20 Suppl 1:i69-76. doi: 10.1093/bioinformatics/bth932.
Computational identification of functional sites in nucleotide sequences is at the core of many algorithms for the analysis of genomic data. This identification is based on the statistical parameters estimated from a training set. Often, because of the huge number of parameters, it is difficult to obtain consistent estimators. To simplify the estimation problem, one imposes independent assumptions between the nucleotides along the site. However, this can potentially limit the minimum value of the estimation error.
In this paper, we introduce a novel method in the context of identifying functional sites, that finds a reasonable set of independence assumptions supported by the data, among the nucleotides, and uses it to perform the identification of the sites by their likelihood ratio. More importantly, in many practical situations it is capable of improving its performance as the training sample size increases. We apply the method to the identification of splice sites, and further evaluate its effect within the context of exon and gene prediction.
核苷酸序列中功能位点的计算识别是许多基因组数据分析算法的核心。这种识别基于从训练集中估计的统计参数。通常,由于参数数量众多,很难获得一致的估计量。为了简化估计问题,人们对位点上的核苷酸之间施加独立假设。然而,这可能会潜在地限制估计误差的最小值。
在本文中,我们在识别功能位点的背景下引入了一种新方法,该方法在核苷酸之间找到一组由数据支持的合理独立假设,并利用它通过似然比来识别位点。更重要的是,在许多实际情况下,随着训练样本量的增加,它能够提高其性能。我们将该方法应用于剪接位点的识别,并在基因外显子和基因预测的背景下进一步评估其效果。