van der Laan Mark J, Dudoit Sandrine, Keles Sunduz
Division of Biostatistics, School of Public Health, University of California, Berkeley, USA.
Stat Appl Genet Mol Biol. 2004;3:Article4. doi: 10.2202/1544-6115.1036. Epub 2004 Mar 22.
Likelihood-based cross-validation is a statistical tool for selecting a density estimate based on n i.i.d. observations from the true density among a collection of candidate density estimators. General examples are the selection of a model indexing a maximum likelihood estimator, and the selection of a bandwidth indexing a nonparametric (e.g. kernel) density estimator. In this article, we establish a finite sample result for a general class of likelihood-based cross-validation procedures (as indexed by the type of sample splitting used, e.g. V-fold cross-validation). This result implies that the cross-validation selector performs asymptotically as well (w.r.t. to the Kullback-Leibler distance to the true density) as a benchmark model selector which is optimal for each given dataset and depends on the true density. Crucial conditions of our theorem are that the size of the validation sample converges to infinity, which excludes leave-one-out cross-validation, and that the candidate density estimates are bounded away from zero and infinity. We illustrate these asymptotic results and the practical performance of likelihood-based cross-validation for the purpose of bandwidth selection with a simulation study. Moreover, we use likelihood-based cross-validation in the context of regulatory motif detection in DNA sequences.
基于似然性的交叉验证是一种统计工具,用于从一组候选密度估计器中,根据来自真实密度的(n)个独立同分布观测值选择一个密度估计。一般的例子包括选择一个对最大似然估计器进行索引的模型,以及选择一个对非参数(如核)密度估计器进行索引的带宽。在本文中,我们为一类基于似然性的交叉验证程序(由所用样本分割类型索引,如(V)折交叉验证)建立了一个有限样本结果。该结果意味着,交叉验证选择器在渐近意义上(相对于到真实密度的库尔贝克 - 莱布勒距离)与一个基准模型选择器表现相同,该基准模型选择器对于每个给定数据集都是最优的,并且依赖于真实密度。我们定理的关键条件是验证样本的大小收敛到无穷大,这排除了留一法交叉验证,并且候选密度估计远离零和无穷大。我们通过模拟研究说明了这些渐近结果以及基于似然性的交叉验证在带宽选择方面的实际性能。此外,我们在DNA序列中调控基序检测的背景下使用基于似然性的交叉验证。