Chen Junxiang, Sun Li, Yu Ke, Batmanghelich Kayhan
Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania.
Proceedings (IEEE Int Conf Bioinformatics Biomed). 2021 Dec;2021:3464-3471. doi: 10.1109/bibm52615.2021.9669878.
Extracting hidden phenotypes is essential in medical data analysis because it facilitates disease subtyping, diagnosis, and understanding of disease etiology. Since the hidden phenotype is usually a low-dimensional representation that comprehensively describes the disease, we require a dimensionality-reduction method that captures as much disease-relevant information as possible. However, most unsupervised or self-supervised methods cannot achieve the goal because they learn a holistic representation containing both disease-relevant and disease-irrelevant information. Supervised methods can capture information that is predictive to the target clinical variable only, but the learned representation is usually not generalizable for the various aspects of the disease. Hence, we develop a dimensionality-reduction approach to extract Disease Relevant Features (DRFs) based on information theory. We propose to use clinical variables that weakly define the disease as so-called . We derive a formulation that makes the DRF predictive of the anchors while forcing the remaining representation to be irrelevant to the anchors via adversarial regularization. We apply our method to a large-scale study of Chronic Obstructive Pulmonary Disease (COPD). Our experiment shows: (1) Learned DRFs are as predictive as the original representation in predicting the anchors, although it is in a significantly lower dimension. (2) Compared to supervised representation, the learned DRFs are more predictive to other relevant disease metrics that are used during the training. (3) The learned DRFs are related to non-imaging biological measurements such as gene expressions, suggesting the DRFs include information related to the underlying biology of the disease.
提取隐藏表型在医学数据分析中至关重要,因为它有助于疾病亚型分类、诊断以及对疾病病因的理解。由于隐藏表型通常是一种低维表示,能全面描述疾病,所以我们需要一种降维方法,尽可能多地捕捉与疾病相关的信息。然而,大多数无监督或自监督方法无法实现这一目标,因为它们学习的是一种包含与疾病相关和不相关信息的整体表示。监督方法只能捕捉对目标临床变量有预测性的信息,但所学习的表示通常在疾病的各个方面都缺乏通用性。因此,我们基于信息论开发了一种降维方法来提取疾病相关特征(DRF)。我们提议使用对疾病定义较弱的临床变量作为所谓的锚点。我们推导了一种公式,使DRF对锚点具有预测性,同时通过对抗正则化迫使其余表示与锚点无关。我们将我们的方法应用于慢性阻塞性肺疾病(COPD)的大规模研究。我们的实验表明:(1)所学习的DRF在预测锚点方面与原始表示具有相同的预测能力,尽管其维度显著更低。(2)与监督表示相比,所学习的DRF对训练期间使用的其他相关疾病指标更具预测性。(3)所学习的DRF与非成像生物测量(如基因表达)相关,这表明DRF包含与疾病潜在生物学相关的信息。