Lijoi Antonio, Mena Ramsés H, Prünster Igor
Department of Economics and Quantitative Methods, University of Pavia, 27100 Pavia and Institute for Applied Mathematics and Information Technology, National Research Council, 20133 Milan, Italy.
BMC Bioinformatics. 2007 Sep 14;8:339. doi: 10.1186/1471-2105-8-339.
Expressed sequence tags (ESTs) analyses are a fundamental tool for gene identification in organisms. Given a preliminary EST sample from a certain library, several statistical prediction problems arise. In particular, it is of interest to estimate how many new genes can be detected in a future EST sample of given size and also to determine the gene discovery rate: these estimates represent the basis for deciding whether to proceed sequencing the library and, in case of a positive decision, a guideline for selecting the size of the new sample. Such information is also useful for establishing sequencing efficiency in experimental design and for measuring the degree of redundancy of an EST library.
In this work we propose a Bayesian nonparametric approach for tackling statistical problems related to EST surveys. In particular, we provide estimates for: a) the coverage, defined as the proportion of unique genes in the library represented in the given sample of reads; b) the number of new unique genes to be observed in a future sample; c) the discovery rate of new genes as a function of the future sample size. The Bayesian nonparametric model we adopt conveys, in a statistically rigorous way, the available information into prediction. Our proposal has appealing properties over frequentist nonparametric methods, which become unstable when prediction is required for large future samples. EST libraries, previously studied with frequentist methods, are analyzed in detail.
The Bayesian nonparametric approach we undertake yields valuable tools for gene capture and prediction in EST libraries. The estimators we obtain do not feature the kind of drawbacks associated with frequentist estimators and are reliable for any size of the additional sample.
表达序列标签(EST)分析是生物中基因识别的基本工具。给定来自某个文库的初步EST样本,会出现几个统计预测问题。特别地,估计在给定大小的未来EST样本中可以检测到多少新基因以及确定基因发现率是很有意义的:这些估计是决定是否继续对文库进行测序的基础,并且在做出肯定决定的情况下,是选择新样本大小的指导方针。此类信息对于在实验设计中确定测序效率以及测量EST文库的冗余程度也很有用。
在这项工作中,我们提出了一种贝叶斯非参数方法来解决与EST调查相关的统计问题。特别地,我们提供了以下估计:a)覆盖率,定义为给定读段样本中所代表的文库中独特基因的比例;b)在未来样本中要观察到的新独特基因的数量;c)作为未来样本大小函数的新基因发现率。我们采用的贝叶斯非参数模型以统计严格的方式将可用信息纳入预测。我们的提议相对于频率主义非参数方法具有吸引人的特性,当需要对大型未来样本进行预测时,频率主义非参数方法会变得不稳定。以前用频率主义方法研究过的EST文库被详细分析。
我们采用的贝叶斯非参数方法为EST文库中的基因捕获和预测提供了有价值的工具。我们获得的估计量没有与频率主义估计量相关的那种缺点,并且对于任何大小的额外样本都是可靠的。