Wu Wei, Bleecker Eugene, Moore Wendy, Busse William W, Castro Mario, Chung Kian Fan, Calhoun William J, Erzurum Serpil, Gaston Benjamin, Israel Elliot, Curran-Everett Douglas, Wenzel Sally E
Lane Center for Computational Biology, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pa.
Center for Human Genomics, School of Medicine, Wake Forest University, Winston-Salem, NC.
J Allergy Clin Immunol. 2014 May;133(5):1280-8. doi: 10.1016/j.jaci.2013.11.042. Epub 2014 Feb 28.
Previous studies have identified asthma phenotypes based on small numbers of clinical, physiologic, or inflammatory characteristics. However, no studies have used a wide range of variables using machine learning approaches.
We sought to identify subphenotypes of asthma by using blood, bronchoscopic, exhaled nitric oxide, and clinical data from the Severe Asthma Research Program with unsupervised clustering and then characterize them by using supervised learning approaches.
Unsupervised clustering approaches were applied to 112 clinical, physiologic, and inflammatory variables from 378 subjects. Variable selection and supervised learning techniques were used to select relevant and nonredundant variables and address their predictive values, as well as the predictive value of the full variable set.
Ten variable clusters and 6 subject clusters were identified, which differed and overlapped with previous clusters. Patients with traditionally defined severe asthma were distributed through subject clusters 3 to 6. Cluster 4 identified patients with early-onset allergic asthma with low lung function and eosinophilic inflammation. Patients with later-onset, mostly severe asthma with nasal polyps and eosinophilia characterized cluster 5. Cluster 6 asthmatic patients manifested persistent inflammation in blood and bronchoalveolar lavage fluid and exacerbations despite high systemic corticosteroid use and side effects. Age of asthma onset, quality of life, symptoms, medications, and health care use were some of the 51 nonredundant variables distinguishing subject clusters. These 51 variables classified test cases with 88% accuracy compared with 93% accuracy with all 112 variables.
The unsupervised machine learning approaches used here provide unique insights into disease, confirming other approaches while revealing novel additional phenotypes.
以往的研究已根据少量临床、生理或炎症特征确定了哮喘表型。然而,尚无研究使用机器学习方法纳入广泛的变量。
我们试图通过对重度哮喘研究项目中的血液、支气管镜检查、呼出一氧化氮和临床数据进行无监督聚类来识别哮喘的亚表型,然后使用监督学习方法对其进行特征描述。
对378名受试者的112个临床、生理和炎症变量应用无监督聚类方法。使用变量选择和监督学习技术来选择相关且无冗余的变量,并评估其预测价值以及整个变量集的预测价值。
识别出10个变量簇和6个受试者簇,它们与先前的簇有所不同且存在重叠。传统定义的重度哮喘患者分布在受试者簇3至6中。簇4识别出肺功能低下且伴有嗜酸性粒细胞炎症的早发性过敏性哮喘患者。簇5的特征是迟发性、大多为伴有鼻息肉和嗜酸性粒细胞增多的重度哮喘患者。簇6的哮喘患者尽管大量使用全身糖皮质激素且出现副作用,但血液和支气管肺泡灌洗液中仍存在持续性炎症且病情加重。哮喘发病年龄、生活质量、症状、药物治疗和医疗保健使用情况是区分受试者簇的51个无冗余变量中的一部分。与使用所有112个变量时93%的准确率相比,这51个变量对测试病例的分类准确率为88%。
此处使用的无监督机器学习方法为疾病提供了独特的见解,在证实其他方法的同时揭示了新的额外表型。