1 Centre for Health Informatics, Institute of Population Health, and.
Am J Respir Crit Care Med. 2013 Dec 1;188(11):1303-12. doi: 10.1164/rccm.201304-0694OC.
Unsupervised statistical learning techniques, such as exploratory factor analysis (EFA) and hierarchical clustering (HC), have been used to identify asthma phenotypes, with partly consistent results. Some of the inconsistency is caused by the variable selection and demographic and clinical differences among study populations.
To investigate the effects of the choice of statistical method and different preparations of data on the clustering results; and to relate these to disease severity.
Several variants of EFA and HC were applied and compared using various sets of variables and different encodings and transformations within a dataset of 383 children with asthma. Variables included lung function, inflammatory and allergy markers, family history, environmental exposures, and medications. Clusters and original variables were related to asthma severity (logistic regression and Bayesian network analysis).
EFA identified five components (eigenvalues ≥ 1) explaining 35% of the overall variance. Variations of the HC (as linkage-distance functions) did not affect the cluster inference; however, using different variable encodings and transformations did. The derived clusters predicted asthma severity less than the original variables. Prognostic factors of severity were medication usage, current symptoms, lung function, paternal asthma, body mass index, and age of asthma onset. Bayesian networks indicated conditional dependence among variables.
The use of different unsupervised statistical learning methods and different variable sets and encodings can lead to multiple and inconsistent subgroupings of asthma, not necessarily correlated with severity. The search for asthma phenotypes needs more careful selection of markers, consistent across different study populations, and more cautious interpretation of results from unsupervised learning.
无监督统计学习技术,如探索性因子分析(EFA)和层次聚类(HC),已被用于鉴定哮喘表型,但结果存在一定差异。这种差异部分是由于研究人群中变量选择和人口统计学及临床特征的差异所致。
探讨统计方法选择和数据不同处理方法对聚类结果的影响,并将这些结果与疾病严重程度相关联。
在包含 383 名哮喘儿童的数据集内,我们应用了几种 EFA 和 HC 变体,并使用不同的变量集和不同的编码及转换进行了比较。纳入的变量包括肺功能、炎症和过敏标志物、家族史、环境暴露和药物。聚类和原始变量与哮喘严重程度相关联(逻辑回归和贝叶斯网络分析)。
EFA 鉴定了 5 个成分(特征值≥1),解释了总方差的 35%。HC 的变化(作为连接距离函数)不影响聚类推断;然而,使用不同的变量编码和转换会影响聚类推断。得出的聚类预测哮喘严重程度的能力不及原始变量。严重程度的预后因素包括药物使用、当前症状、肺功能、父亲哮喘、体重指数和哮喘发病年龄。贝叶斯网络表明了变量之间的条件依赖关系。
使用不同的无监督统计学习方法和不同的变量集及编码会导致哮喘的多种且不一致的亚组化,这些亚组化不一定与严重程度相关联。寻找哮喘表型需要在不同的研究人群中更仔细地选择标志物,并更谨慎地解释无监督学习的结果。