Department of Industrial Engineering, Tsinghua University, Beijing, China.
Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
J Biomed Inform. 2019 Mar;91:103122. doi: 10.1016/j.jbi.2019.103122. Epub 2019 Feb 7.
Phenotyping algorithms can efficiently and accurately identify patients with a specific disease phenotype and construct electronic health records (EHR)-based cohorts for subsequent clinical or genomic studies. Previous studies have introduced unsupervised EHR-based feature selection methods that yielded algorithms with high accuracy. However, those selection methods still require expert intervention to tweak the parameter settings according to the EHR data distribution for each phenotype. To further accelerate the development of phenotyping algorithms, we propose a fully automated and robust unsupervised feature selection method that leverages only publicly available medical knowledge sources, instead of EHR data.
SEmantics-Driven Feature Extraction (SEDFE) collects medical concepts from online knowledge sources as candidate features and gives them vector-form distributional semantic representations derived with neural word embedding and the Unified Medical Language System Metathesaurus. A number of features that are semantically closest and that sufficiently characterize the target phenotype are determined by a linear decomposition criterion and are selected for the final classification algorithm.
SEDFE was compared with the EHR-based SAFE algorithm and domain experts on feature selection for the classification of five phenotypes including coronary artery disease, rheumatoid arthritis, Crohn's disease, ulcerative colitis, and pediatric pulmonary arterial hypertension using both supervised and unsupervised approaches. Algorithms yielded by SEDFE achieved comparable accuracy to those yielded by SAFE and expert-curated features. SEDFE is also robust to the input semantic vectors.
SEDFE attains satisfying performance in unsupervised feature selection for EHR phenotyping. Both fully automated and EHR-independent, this method promises efficiency and accuracy in developing algorithms for high-throughput phenotyping.
表型算法可以有效地、准确地识别具有特定疾病表型的患者,并构建基于电子健康记录 (EHR) 的队列,用于后续的临床或基因组研究。先前的研究已经引入了基于无监督的 EHR 的特征选择方法,这些方法产生了具有高精度的算法。然而,这些选择方法仍然需要专家干预,根据每个表型的 EHR 数据分布来调整参数设置。为了进一步加速表型算法的开发,我们提出了一种完全自动化的、稳健的基于无监督的特征选择方法,该方法仅利用公共可用的医学知识库,而不使用 EHR 数据。
语义驱动特征提取 (SEDFE) 从在线知识库中收集医学概念作为候选特征,并使用神经词嵌入和统一医学语言系统元词表为其赋予向量形式的分布语义表示。通过线性分解标准确定与目标表型语义上最接近且足以描述目标表型的多个特征,并将其选择用于最终的分类算法。
SEDFE 分别与基于 EHR 的 SAFE 算法和领域专家在使用有监督和无监督方法对包括冠状动脉疾病、类风湿关节炎、克罗恩病、溃疡性结肠炎和小儿肺动脉高压在内的 5 种表型进行分类的特征选择方面进行了比较。SEDFE 生成的算法的准确性可与 SAFE 生成的算法和专家精心挑选的特征相媲美。SEDFE 对输入语义向量也具有鲁棒性。
SEDFE 在 EHR 表型无监督特征选择中表现出令人满意的性能。它完全自动化且独立于 EHR,有望在开发高通量表型算法方面提高效率和准确性。