Institute of Computational Biology, Helmholtz Zentrum München, German Research Center for Environmental Health GmbH, Neuherberg, Germany.
Technische Universität München, Center for Mathematics, Chair of Mathematical Modeling of Biological Systems, Garching, Germany.
Allergy. 2019 Jul;74(7):1364-1373. doi: 10.1111/all.13745. Epub 2019 Mar 31.
Associations between childhood asthma phenotypes and genetic, immunological, and environmental factors have been previously established. Yet, strategies to integrate high-dimensional risk factors from multiple distinct data sets, and thereby increase the statistical power of analyses, have been hampered by a preponderance of missing data and lack of methods to accommodate them.
We assembled questionnaire, diagnostic, genotype, microarray, RT-qPCR, flow cytometry, and cytokine data (referred to as data modalities) to use as input factors for a classifier that could distinguish healthy children, mild-to-moderate allergic asthmatics, and nonallergic asthmatics. Based on data from 260 German children aged 4-14 from our university outpatient clinic, we built a novel multilevel prediction approach for asthma outcome which could deal with a present complex missing data structure.
The optimal learning method was boosting based on all data sets, achieving an area underneath the receiver operating characteristic curve (AUC) for three classes of phenotypes of 0.81 (95%-confidence interval (CI): 0.65-0.94) using leave-one-out cross-validation. Besides improving the AUC, our integrative multilevel learning approach led to tighter CIs than using smaller complete predictor data sets (AUC = 0.82 [0.66-0.94] for boosting). The most important variables for classifying childhood asthma phenotypes comprised novel identified genes, namely PKN2 (protein kinase N2), PTK2 (protein tyrosine kinase 2), and ALPP (alkaline phosphatase, placental).
Our combination of several data modalities using a novel strategy improved classification of childhood asthma phenotypes but requires validation in external populations. The generic approach is applicable to other multilevel data-based risk prediction settings, which typically suffer from incomplete data.
儿童哮喘表型与遗传、免疫和环境因素之间的关联以前已经建立。然而,整合来自多个不同数据集的高维风险因素的策略,从而提高分析的统计能力,受到大量缺失数据和缺乏容纳这些数据的方法的阻碍。
我们收集了问卷、诊断、基因分型、微阵列、实时定量 PCR、流式细胞术和细胞因子数据(称为数据模态),作为能够区分健康儿童、轻度至中度过敏性哮喘和非过敏性哮喘的分类器的输入因素。基于我们大学门诊的 260 名 4-14 岁德国儿童的数据,我们建立了一种新的多水平预测方法来预测哮喘结局,可以处理当前复杂的缺失数据结构。
最佳的学习方法是基于所有数据集的提升,使用留一交叉验证,对于三种表型的分类,获得的接收器操作特征曲线下面积(AUC)为 0.81(95%置信区间(CI):0.65-0.94)。除了提高 AUC 之外,我们的综合多层次学习方法还导致了比使用较小的完整预测数据集更紧的 CI(AUC = 0.82 [0.66-0.94] 用于提升)。用于分类儿童哮喘表型的最重要变量包括新鉴定的基因,即 PKN2(蛋白激酶 N2)、PTK2(蛋白酪氨酸激酶 2)和 ALPP(碱性磷酸酶,胎盘)。
我们使用新策略结合了几种数据模态,改善了儿童哮喘表型的分类,但需要在外部人群中进行验证。通用方法适用于其他基于多层次数据的风险预测设置,这些设置通常存在数据不完整的问题。