NOAA Fisheries, Northwest Fisheries Science Center, 2725 Montlake Blvd. East, Seattle, WA 98112, USA.
Mol Ecol. 2010 Jul;19(13):2599-601. doi: 10.1111/j.1365-294X.2010.04675.x.
Recognition of the importance of cross-validation ('any technique or instance of assessing how the results of a statistical analysis will generalize to an independent dataset'; Wiktionary, en.wiktionary.org) is one reason that the U.S. Securities and Exchange Commission requires all investment products to carry some variation of the disclaimer, 'Past performance is no guarantee of future results.' Even a cursory examination of financial behaviour, however, demonstrates that this warning is regularly ignored, even by those who understand what an independent dataset is. In the natural sciences, an analogue to predicting future returns for an investment strategy is predicting power of a particular algorithm to perform with new data. Once again, the key to developing an unbiased assessment of future performance is through testing with independent data--that is, data that were in no way involved in developing the method in the first place. A 'gold-standard' approach to cross-validation is to divide the data into two parts, one used to develop the algorithm, the other used to test its performance. Because this approach substantially reduces the sample size that can be used in constructing the algorithm, researchers often try other variations of cross-validation to accomplish the same ends. As illustrated by Anderson in this issue of Molecular Ecology Resources, however, not all attempts at cross-validation produce the desired result. Anderson used simulated data to evaluate performance of several software programs designed to identify subsets of loci that can be effective for assigning individuals to population of origin based on multilocus genetic data. Such programs are likely to become increasingly popular as researchers seek ways to streamline routine analyses by focusing on small sets of loci that contain most of the desired signal. Anderson found that although some of the programs made an attempt at cross-validation, all failed to meet the 'gold standard' of using truly independent data and therefore produced overly optimistic assessments of power of the selected set of loci--a phenomenon known as 'high grading bias.'
认识到交叉验证的重要性(“任何一种评估统计分析结果在独立数据集上推广能力的技术或实例”;Wiktionary,en.wiktionary.org)是美国证券交易委员会要求所有投资产品都带有某种免责声明的原因之一,即“过去的表现不能保证未来的结果”。然而,只要对金融行为稍加研究,就会发现即使是那些知道什么是独立数据集的人,也经常忽略这一警告。在自然科学中,预测投资策略未来回报的类似方法是预测特定算法在新数据上的性能。同样,开发对未来绩效无偏差评估的关键是通过使用独立数据进行测试——也就是说,这些数据在最初开发方法时根本没有参与。交叉验证的“黄金标准”方法是将数据分为两部分,一部分用于开发算法,另一部分用于测试其性能。由于这种方法大大减少了可以用于构建算法的样本量,因此研究人员经常尝试其他交叉验证变体来达到相同的目的。然而,正如安德森在本期《分子生态学资源》中所说明的那样,并非所有的交叉验证尝试都能产生预期的结果。安德森使用模拟数据评估了几种软件程序的性能,这些程序旨在识别能够根据多基因数据将个体分配到起源种群的有效基因座子集。随着研究人员寻求通过专注于包含大部分所需信号的小基因座集来简化常规分析的方法,这类程序可能会越来越受欢迎。安德森发现,尽管有些程序尝试了交叉验证,但都没有达到使用真正独立数据的“黄金标准”,因此对所选基因座集的能力产生了过于乐观的评估——这种现象被称为“高分偏差”。