Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.
Harvard Medical School, Boston, Massachusetts, USA.
J Am Med Inform Assoc. 2020 Aug 1;27(8):1235-1243. doi: 10.1093/jamia/ocaa079.
A major bottleneck hindering utilization of electronic health record data for translational research is the lack of precise phenotype labels. Chart review as well as rule-based and supervised phenotyping approaches require laborious expert input, hampering applicability to studies that require many phenotypes to be defined and labeled de novo. Though International Classification of Diseases codes are often used as surrogates for true labels in this setting, these sometimes suffer from poor specificity. We propose a fully automated topic modeling algorithm to simultaneously annotate multiple phenotypes.
Surrogate-guided ensemble latent Dirichlet allocation (sureLDA) is a label-free multidimensional phenotyping method. It first uses the PheNorm algorithm to initialize probabilities based on 2 surrogate features for each target phenotype, and then leverages these probabilities to constrain the LDA topic model to generate phenotype-specific topics. Finally, it combines phenotype-feature counts with surrogates via clustering ensemble to yield final phenotype probabilities.
sureLDA achieves reliably high accuracy and precision across a range of simulated and real-world phenotypes. Its performance is robust to phenotype prevalence and relative informativeness of surogate vs nonsurrogate features. It also exhibits powerful feature selection properties.
sureLDA combines attractive properties of PheNorm and LDA to achieve high accuracy and precision robust to diverse phenotype characteristics. It offers particular improvement for phenotypes insufficiently captured by a few surrogate features. Moreover, sureLDA's feature selection ability enables it to handle high feature dimensions and produce interpretable computational phenotypes.
sureLDA is well suited toward large-scale electronic health record phenotyping for highly multiphenotype applications such as phenome-wide association studies .
电子健康记录数据在转化研究中的应用受到一个主要瓶颈的限制,即缺乏精确的表型标签。图表审查以及基于规则和监督的表型方法需要费力的专家投入,这阻碍了需要定义和标记许多新表型的研究的适用性。虽然在这种情况下,国际疾病分类代码通常被用作真实标签的替代品,但这些代码有时特异性较差。我们提出了一种完全自动化的主题建模算法,以同时注释多个表型。
Surrogate-guided 集成潜在狄利克雷分配(sureLDA)是一种无标签多维表型分析方法。它首先使用 PheNorm 算法根据每个目标表型的 2 个替代特征初始化概率,然后利用这些概率来约束 LDA 主题模型生成特定于表型的主题。最后,它通过聚类集成将表型-特征计数与替代物结合起来,得到最终的表型概率。
sureLDA 在一系列模拟和真实世界的表型中都能可靠地实现高准确性和高精度。其性能对表型流行率以及替代物与非替代物特征的相对信息量具有鲁棒性。它还具有强大的特征选择特性。
sureLDA 将 PheNorm 和 LDA 的吸引人的特性结合起来,实现了对各种表型特征具有鲁棒性的高准确性和高精度。它为少数替代物特征不足以捕捉到的表型提供了特别的改进。此外,sureLDA 的特征选择能力使其能够处理高特征维度并产生可解释的计算表型。
sureLDA 非常适合用于大规模电子健康记录表型分析,适用于表型广泛的关联研究等高度多表型应用。