Huang Yuan, Liu Jin, Yi Huangdi, Shia Ben-Chang, Ma Shuangge
VA Cooperative Studies Program Coordinating Center, West Haven, CT; Department of Biostatistics, Yale University, New Haven, CT, U.S.A.
Center of Quantitative Medicine, Duke-NUS Medical School, Singapore.
Stat Med. 2017 Feb 10;36(3):509-559. doi: 10.1002/sim.7138. Epub 2016 Sep 25.
In profiling studies, the analysis of a single dataset often leads to unsatisfactory results because of the small sample size. Multi-dataset analysis utilizes information of multiple independent datasets and outperforms single-dataset analysis. Among the available multi-dataset analysis methods, integrative analysis methods aggregate and analyze raw data and outperform meta-analysis methods, which analyze multiple datasets separately and then pool summary statistics. In this study, we conduct integrative analysis and marker selection under the heterogeneity structure, which allows different datasets to have overlapping but not necessarily identical sets of markers. Under certain scenarios, it is reasonable to expect some similarity of identified marker sets - or equivalently, similarity of model sparsity structures - across multiple datasets. However, the existing methods do not have a mechanism to explicitly promote such similarity. To tackle this problem, we develop a sparse boosting method. This method uses a BIC/HDBIC criterion to select weak learners in boosting and encourages sparsity. A new penalty is introduced to promote the similarity of model sparsity structures across datasets. The proposed method has a intuitive formulation and is broadly applicable and computationally affordable. In numerical studies, we analyze right censored survival data under the accelerated failure time model. Simulation shows that the proposed method outperforms alternative boosting and penalization methods with more accurate marker identification. The analysis of three breast cancer prognosis datasets shows that the proposed method can identify marker sets with increased similarity across datasets and improved prediction performance. Copyright © 2016 John Wiley & Sons, Ltd.
在剖析研究中,由于样本量较小,对单个数据集进行分析往往会得到不尽人意的结果。多数据集分析利用多个独立数据集的信息,其性能优于单数据集分析。在现有的多数据集分析方法中,整合分析方法对原始数据进行汇总和分析,其性能优于元分析方法,元分析方法是分别分析多个数据集,然后汇总统计摘要。在本研究中,我们在异质性结构下进行整合分析和标记选择,这允许不同的数据集具有重叠但不一定相同的标记集。在某些情况下,合理的预期是多个数据集中识别出的标记集具有一定的相似性——或者等效地,模型稀疏结构具有相似性。然而,现有方法没有明确促进这种相似性的机制。为了解决这个问题,我们开发了一种稀疏提升方法。该方法在提升过程中使用BIC/HDBIC准则来选择弱学习器,并鼓励稀疏性。引入了一种新的惩罚项来促进不同数据集之间模型稀疏结构的相似性。所提出的方法具有直观的公式,广泛适用且计算成本低。在数值研究中,我们在加速失效时间模型下分析右删失生存数据。模拟表明,所提出的方法在标记识别方面比其他提升和惩罚方法更准确,性能更优。对三个乳腺癌预后数据集的分析表明,所提出的方法能够识别出不同数据集之间相似性增加且预测性能提高的标记集。版权所有© 2016约翰威立父子有限公司。