Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA.
Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA; The Second Clinical College Guangzhou University of Chinese Medicine, China.
J Biomed Inform. 2019 Aug;96:103246. doi: 10.1016/j.jbi.2019.103246. Epub 2019 Jun 27.
In precision medicine, deep phenotyping is defined as the precise and comprehensive analysis of phenotypic abnormalities, aiming to acquire a better understanding of the natural history of a disease and its genotype-phenotype associations. Detecting phenotypic relevance is an important task when translating precision medicine into clinical practice, especially for patient stratification tasks based on deep phenotyping. In our previous work, we developed node embeddings for the Human Phenotype Ontology (HPO) to assist in phenotypic relevance measurement incorporating distributed semantic representations. However, the derived HPO embeddings hold only distributed representations for IS-A relationships among nodes, hampering the ability to fully explore the graph.
In this study, we developed a framework, HPO2Vec+, to enrich the produced HPO embeddings with heterogeneous knowledge resources (i.e., DECIPHER, OMIM, and Orphanet) for detecting phenotypic relevance. Specifically, we parsed disease-phenotype associations contained in these three resources to enrich non-inheritance relationships among phenotypic nodes in the HPO. To generate node embeddings for the HPO, node2vec was applied to perform node sampling on the enriched HPO graphs based on random walk followed by feature learning over the sampled nodes to generate enriched node embeddings. Four HPO embeddings were generated based on different graph structures, which we hereafter label as HPOEmb-Original, HPOEmb-DECIPHER, HPOEmb-OMIM, and HPOEmb-Orphanet. We evaluated the derived embeddings quantitatively through an HPO link prediction task with four edge embeddings operations and six machine learning algorithms. The resulting best embeddings were then evaluated for patient stratification of 10 rare diseases using electronic health records (EHR) collected at Mayo Clinic. We assessed our framework qualitatively by visualizing phenotypic clusters and conducting a use case study on primary hyperoxaluria (PH), a rare disease, on the task of inferring relevant phenotypes given 22 annotated PH related phenotypes.
The quantitative link prediction task shows that HPOEmb-Orphanet achieved an optimal AUROC of 0.92 and an average precision of 0.94. In addition, HPOEmb-Orphanet achieved an optimal F1 score of 0.86. The quantitative patient similarity measurement task indicates that HPOEmb-Orphanet achieved the highest average detection rate for similar patients over 10 rare diseases and performed better than other similarity measures implemented by an existing tool, HPOSim, especially for pairwise patients with fewer shared common phenotypes. The qualitative evaluation shows that the enriched HPO embeddings are generally able to detect relationships among nodes with fine granularity and HPOEmb-Orphanet is particularly good at associating phenotypes across different disease systems. For the use case of detecting relevant phenotypic characterizations for given PH related phenotypes, HPOEmb-Orphanet outperformed the other three HPO embeddings by achieving the highest average P@5 of 0.81 and the highest P@10 of 0.79. Compared to seven conventional similarity measurements provided by HPOSim, HPOEmb-Orphanet is able to detect more relevant phenotypic pairs, especially for pairs not in inheritance relationships.
We drew the following conclusions based on the evaluation results. First, with additional non-inheritance edges, enriched HPO embeddings can detect more associations between fine granularity phenotypic nodes regardless of their topological structures in the HPO graph. Second, HPOEmb-Orphanet not only can achieve the optimal performance through link prediction and patient stratification based on phenotypic similarity, but is also able to detect relevant phenotypes closer to domain expert's judgments than other embeddings and conventional similarity measurements. Third, incorporating heterogeneous knowledge resources do not necessarily result in better performance for detecting relevant phenotypes. From a clinical perspective, in our use case study, clinical-oriented knowledge resources (e.g., Orphanet) can achieve better performance in detecting relevant phenotypic characterizations compared to biomedical-oriented knowledge resources (e.g., DECIPHER and OMIM).
在精准医学中,深度表型定义为对表型异常的精确和全面分析,旨在更好地了解疾病的自然史及其基因型-表型关联。检测表型相关性是将精准医学转化为临床实践的重要任务,特别是对于基于深度表型的患者分层任务。在我们之前的工作中,我们开发了用于人类表型本体(HPO)的节点嵌入,以协助包含分布式语义表示的表型相关性测量。然而,所得的 HPO 嵌入仅对节点之间的 IS-A 关系持有分布式表示,阻碍了充分挖掘图的能力。
在这项研究中,我们开发了一个框架 HPO2Vec+,通过解析包含在 DECIPHER、OMIM 和 Orphanet 中的疾病-表型关联,利用异构知识资源丰富生成的 HPO 嵌入,以检测表型相关性。具体来说,我们解析了这些资源中包含的疾病-表型关联,以丰富 HPO 中表型节点之间的非遗传关系。为了生成 HPO 的节点嵌入,应用了 node2vec 在基于随机游走的富集 HPO 图上进行节点采样,然后对采样节点进行特征学习,生成富集的节点嵌入。根据不同的图结构生成了四个 HPO 嵌入,我们分别标记为 HPOEmb-Original、HPOEmb-DECIPHER、HPOEmb-OMIM 和 HPOEmb-Orphanet。我们通过四种边嵌入操作和六种机器学习算法对衍生的嵌入进行定量评估,通过 HPO 链接预测任务。然后,我们使用 Mayo 诊所收集的电子健康记录(EHR)对 10 种罕见疾病进行患者分层评估,对最佳嵌入进行评估。我们通过可视化表型聚类和对原发性高草酸尿症(PH)这一罕见疾病进行用例研究,定性评估我们的框架,在给定 22 个注释的 PH 相关表型的情况下推断相关表型。
定量链接预测任务表明,HPOEmb-Orphanet 实现了最优 AUROC 为 0.92 和平均精度为 0.94。此外,HPOEmb-Orphanet 实现了最优 F1 分数为 0.86。定量患者相似性测量任务表明,HPOEmb-Orphanet 在 10 种罕见疾病中对相似患者的平均检测率最高,并且优于现有的 HPOSim 工具实现的其他相似性度量,特别是对于具有较少共同常见表型的成对患者。定性评估表明,富集的 HPO 嵌入通常能够以细粒度检测节点之间的关系,并且 HPOEmb-Orphanet 特别擅长将不同疾病系统的表型联系起来。对于给定 PH 相关表型检测相关表型特征的用例,HPOEmb-Orphanet 通过实现最高平均 P@5 为 0.81 和最高 P@10 为 0.79,优于其他三个 HPO 嵌入。与 HPOSim 提供的七种常规相似性度量相比,HPOEmb-Orphanet 能够检测到更多相关的表型对,特别是对于不在遗传关系中的表型对。
根据评估结果,我们得出以下结论。首先,通过添加非遗传边,富集的 HPO 嵌入可以检测到更细粒度的表型节点之间的更多关联,而不管它们在 HPO 图中的拓扑结构如何。其次,HPOEmb-Orphanet 不仅可以通过基于表型相似性的链接预测和患者分层实现最佳性能,而且可以检测到与其他嵌入和常规相似性度量相比更接近领域专家判断的相关表型。第三,纳入异构知识资源不一定会提高检测相关表型的性能。从临床角度来看,在我们的用例研究中,面向临床的知识资源(例如 Orphanet)在检测相关表型特征方面的性能优于面向生物医学的知识资源(例如 DECIPHER 和 OMIM)。