Department of Statistics, Penn State University, 301 Thomas Building, State College, PA 16801, USA.
Biostatistics. 2012 Jul;13(3):509-22. doi: 10.1093/biostatistics/kxr033. Epub 2011 Oct 31.
In high-throughput cancer genomic studies, markers identified from the analysis of single data sets often suffer a lack of reproducibility because of the small sample sizes. An ideal solution is to conduct large-scale prospective studies, which are extremely expensive and time consuming. A cost-effective remedy is to pool data from multiple comparable studies and conduct integrative analysis. Integrative analysis of multiple data sets is challenging because of the high dimensionality of genomic measurements and heterogeneity among studies. In this article, we propose a sparse boosting approach for marker identification in integrative analysis of multiple heterogeneous cancer diagnosis studies with gene expression measurements. The proposed approach can effectively accommodate the heterogeneity among multiple studies and identify markers with consistent effects across studies. Simulation shows that the proposed approach has satisfactory identification results and outperforms alternatives including an intensity approach and meta-analysis. The proposed approach is used to identify markers of pancreatic cancer and liver cancer.
在高通量癌症基因组研究中,由于样本量小,从单一数据集分析中识别出的标记往往缺乏可重复性。理想的解决方案是进行大规模的前瞻性研究,但这非常昂贵且耗时。一种经济有效的补救方法是汇集来自多个可比研究的数据并进行综合分析。由于基因组测量的高维度和研究之间的异质性,对多个数据集进行综合分析具有挑战性。在本文中,我们提出了一种稀疏提升方法,用于对具有基因表达测量的多个异质癌症诊断研究的综合分析中的标记进行识别。所提出的方法可以有效地适应多个研究之间的异质性,并识别出在多个研究中具有一致效果的标记。模拟表明,所提出的方法具有令人满意的识别结果,优于包括强度方法和荟萃分析在内的替代方法。所提出的方法用于识别胰腺癌和肝癌的标志物。