Biostatistics/Bioinformatics Shared Resources, Moores Cancer Center, University of California San Diego, La Jolla, CA 92093-0901, USA.
Stat Med. 2012 Dec 30;31(30):4175-89. doi: 10.1002/sim.5455. Epub 2012 Jul 5.
In this paper, we describe the implementation and evaluation of a cluster-based enrichment strategy to call hits from a high-throughput screen using a typical cell-based assay of 160,000 chemical compounds. Our focus is on statistical properties of the prospective design choices throughout the analysis, including how to choose the number of clusters for optimal power, the choice of test statistic, the significance thresholds for clusters and the activity threshold for candidate hits, how to rank selected hits for carry-forward to the confirmation screen, and how to identify confirmed hits in a data-driven manner. Whereas previously the literature has focused on choice of test statistic or chemical descriptors, our studies suggest that cluster size is the more important design choice. We recommend clusters to be ranked by enrichment odds ratio, not by p-value. Our conceptually simple test statistic is seen to identify the same set of hits as more complex scoring methods proposed in the literature do. We prospectively confirm that such a cluster-based approach can outperform the naive top X approach and estimate that we improved confirmation rates by about 31.5% from 813 using the top X approach to 1187 using our cluster-based method.
在本文中,我们描述了一种基于聚类的富集策略的实现和评估,该策略用于使用典型的基于细胞的 160000 种化合物高通量筛选方法来调用命中。我们的重点是分析过程中整个前瞻性设计选择的统计特性,包括如何选择最佳功效的聚类数量、选择检验统计量、聚类的显著性阈值和候选命中的活性阈值、如何对选定的命中进行排名以便继续进行确认筛选,以及如何以数据驱动的方式识别确认命中。尽管之前的文献主要关注检验统计量或化学描述符的选择,但我们的研究表明,聚类大小是更重要的设计选择。我们建议根据富集优势比对聚类进行排名,而不是根据 p 值进行排名。我们提出的概念简单的检验统计量被证明可以识别出与文献中提出的更复杂的评分方法相同的命中集。我们前瞻性地证实,这种基于聚类的方法可以优于简单的前 X 方法,估计我们通过使用基于聚类的方法从使用前 X 方法的 813 个提高到 1187 个,提高了确认率约 31.5%。