Department of Biomedical Engineering, Vanderbilt University, Nashville, Tennessee, USA.
Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA.
J Am Med Inform Assoc. 2023 Feb 16;30(3):456-465. doi: 10.1093/jamia/ocac234.
A previous study, PheMAP, combined independent, online resources to enable high-throughput phenotyping (HTP) using electronic health records (EHRs). However, online resources offer distinct quality descriptions of diseases which may affect phenotyping performance. We aimed to evaluate the phenotyping performance of single resource-based PheMAPs and investigate an optimized strategy for HTP.
We compared how each resource produced top-ranked concept unique identifiers (CUIs) by term frequency-inverse document frequency with Jaccard matrices comparing single resources and the original PheMAP. We correlated top-ranked concepts from each resource to features used in established Phenotype KnowledgeBase (PheKB) algorithms for hypothyroidism, type II diabetes mellitus (T2DM), and dementias. Using resources separately, we calculated multiple phenotype risk scores for individuals from Vanderbilt University Medical Center's BioVU DNA Biobank and compared phenotyping performance against rule-based eMERGE algorithms. Lastly, we implemented an ensemble strategy which classified patient case/control status based upon PheMAP resource agreement.
Jaccard similarity matrices indicate that the similarity of CUIs comprising single resource-based PheMAPs varies. Single resource-based PheMAPs generated from MedlinePlus and MedicineNet outperformed others but only encompass 81.6% of overall disease phenotypes. We propose the PheMAP-Ensemble which provides higher average accuracy and precision than the combined average accuracy and precision of single resource-based PheMAPs. While offering complete phenotype coverage, PheMAP-Ensemble significantly increases phenotyping recall compared to the original iteration.
Resources comprising the PheMAP produce different phenotyping performance when implemented individually. The ensemble method significantly improves the quality of PheMAP by fully utilizing dissimilar resources to capture accurate phenotyping data from EHRs.
先前的 PheMAP 研究结合了独立的在线资源,从而能够使用电子健康记录(EHR)进行高通量表型分析(HTP)。然而,在线资源对疾病的描述质量各不相同,这可能会影响表型分析的性能。我们旨在评估基于单一资源的 PheMAP 的表型分析性能,并研究一种用于 HTP 的优化策略。
我们比较了每个资源如何通过使用 Jaccard 矩阵比较单一资源和原始 PheMAP 来使用术语频率-文档频率对顶级概念唯一标识符(CUI)进行排名。我们将每个资源的顶级概念与已建立的 Phenotype KnowledgeBase(PheKB)算法用于甲状腺功能减退症、2 型糖尿病(T2DM)和痴呆症的特征相关联。我们分别使用资源为范德比尔特大学医学中心的 BioVU DNA 生物库中的个体计算多个表型风险评分,并将表型分析性能与基于规则的 eMERGE 算法进行比较。最后,我们实施了一种基于 PheMAP 资源一致性的分类患者病例/对照状态的集成策略。
Jaccard 相似性矩阵表明,构成基于单一资源的 PheMAP 的 CUI 的相似性有所不同。基于 MedlinePlus 和 MedicineNet 的单一资源的 PheMAP 表现优于其他资源,但仅包含 81.6%的总体疾病表型。我们提出了 PheMAP-Ensemble,它提供的平均准确率和精度高于基于单一资源的 PheMAP 的平均准确率和精度的总和。虽然提供了完整的表型覆盖范围,但 PheMAP-Ensemble 与原始迭代相比显著提高了表型分析的召回率。
当单独实施时,构成 PheMAP 的资源会产生不同的表型分析性能。集成方法通过充分利用不同的资源来从 EHR 中捕获准确的表型数据,从而显著提高了 PheMAP 的质量。