Garcelon Nicolas, Neuraz Antoine, Benoit Vincent, Salomon Rémi, Kracker Sven, Suarez Felipe, Bahi-Buisson Nadia, Hadj-Rabia Smail, Fischer Alain, Munnich Arnold, Burgun Anita
Institut Imagine, Paris Descartes Université Paris Descartes-Sorbonne Paris Cité, Paris, France; INSERM, Institut Imagine, UMR 1163, Université Paris Descartes, Sorbonne Paris Cité, Paris, France; INSERM, Centre de Recherche des Cordeliers, UMR 1138 Equipe 22, Université Paris Descartes, Sorbonne Paris Cité, Paris, France.
INSERM, Centre de Recherche des Cordeliers, UMR 1138 Equipe 22, Université Paris Descartes, Sorbonne Paris Cité, Paris, France; Département d'informatique médicale, Hôpital Necker-Enfants Malades, Assistance Publique-Hôpitaux de Paris (AP-HP), Université Paris Descartes, Sorbonne Paris Cité, France.
J Biomed Inform. 2017 Sep;73:51-61. doi: 10.1016/j.jbi.2017.07.016. Epub 2017 Jul 25.
In the context of rare diseases, it may be helpful to detect patients with similar medical histories, diagnoses and outcomes from a large number of cases with automated methods. To reduce the time to find new cases, we developed a method to find similar patients given an index case leveraging data from the electronic health records.
We used the clinical data warehouse of a children academic hospital in Paris, France (Necker-Enfants Malades), containing about 400,000 patients. Our model was based on a vector space model (VSM) to compute the similarity distance between an index patient and all the patients of the data warehouse. The dimensions of the VSM were built upon Unified Medical Language System concepts extracted from clinical narratives stored in the clinical data warehouse. The VSM was enhanced using three parameters: a pertinence score (TF-IDF of the concepts), the polarity of the concept (negated/not negated) and the minimum number of concepts in common. We evaluated this model by displaying the most similar patients for five different rare diseases: Lowe Syndrome (LOWE), Dystrophic Epidermolysis Bullosa (DEB), Activated PI3K delta Syndrome (APDS), Rett Syndrome (RETT) and Dowling Meara (EBS-DM), from the clinical data warehouse representing 18, 103, 21, 84 and 7 patients respectively.
The percentages of index patients returning at least one true positive similar patient in the Top30 similar patients were 94% for LOWE, 97% for DEB, 86% for APDS, 71% for EBS-DM and 99% for RETT. The mean number of patients with the exact same genetic diseases among the 30 returned patients was 51%.
This tool offers new perspectives in a translational context to identify patients for genetic research. Moreover, when new molecular bases are discovered, our strategy will help to identify additional eligible patients for genetic screening.
在罕见病背景下,使用自动化方法从大量病例中检测具有相似病史、诊断和结局的患者可能会有所帮助。为了缩短寻找新病例的时间,我们开发了一种方法,利用电子健康记录中的数据,在给定索引病例的情况下找到相似患者。
我们使用了法国巴黎一家儿童学术医院(内克尔儿童医院)的临床数据仓库,其中包含约40万名患者。我们的模型基于向量空间模型(VSM)来计算索引患者与数据仓库中所有患者之间的相似性距离。VSM的维度基于从临床数据仓库中存储的临床叙述中提取的统一医学语言系统概念构建。通过三个参数增强VSM:相关性得分(概念的词频-逆文档频率)、概念的极性(否定/未否定)和共同概念的最小数量。我们通过展示来自临床数据仓库中分别代表18、103、21、84和7名患者的五种不同罕见病(洛氏综合征(LOWE)、营养不良性大疱性表皮松解症(DEB)、活化磷脂酰肌醇3激酶δ综合征(APDS)、雷特综合征(RETT)和道林·米拉(EBS-DM))的最相似患者来评估该模型。
在排名前30的相似患者中,返回至少一名真阳性相似患者的索引患者百分比分别为:LOWE为94%,DEB为97%,APDS为86%,EBS-DM为71%,RETT为99%。在返回的30名患者中,患有完全相同遗传疾病的患者平均数量为51%。
该工具在转化背景下为基因研究识别患者提供了新的视角。此外,当发现新的分子基础时,我们的策略将有助于识别更多符合条件的患者进行基因筛查。