School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, China.
National Engineering Lab for Big Data Analytics, Xi'an Jiaotong University, Xi'an, China.
J Med Internet Res. 2021 Aug 4;23(8):e25670. doi: 10.2196/25670.
Genealogical information, such as that found in family trees, is imperative for biomedical research such as disease heritability and risk prediction. Researchers have used policyholder and their dependent information in medical claims data and emergency contacts in electronic health records (EHRs) to infer family relationships at a large scale. We have previously demonstrated that online obituaries can be a novel data source for building more complete and accurate family trees.
Aiming at supplementing EHR data with family relationships for biomedical research, we built an end-to-end information extraction system using a multitask-based artificial neural network model to construct genealogical knowledge graphs (GKGs) from online obituaries. GKGs are enriched family trees with detailed information including age, gender, death and birth dates, and residence.
Built on a predefined family relationship map consisting of 4 types of entities (eg, people's name, residence, birth date, and death date) and 71 types of relationships, we curated a corpus containing 1700 online obituaries from the metropolitan area of Minneapolis and St Paul in Minnesota. We also adopted data augmentation technology to generate additional synthetic data to alleviate the issue of data scarcity for rare family relationships. A multitask-based artificial neural network model was then built to simultaneously detect names, extract relationships between them, and assign attributes (eg, birth dates and death dates, residence, age, and gender) to each individual. In the end, we assemble related GKGs into larger ones by identifying people appearing in multiple obituaries.
Our system achieved satisfying precision (94.79%), recall (91.45%), and F-1 measures (93.09%) on 10-fold cross-validation. We also constructed 12,407 GKGs, with the largest one made up of 4 generations and 30 people.
In this work, we discussed the meaning of GKGs for biomedical research, presented a new version of a corpus with a predefined family relationship map and augmented training data, and proposed a multitask deep neural system to construct and assemble GKGs. The results show our system can extract and demonstrate the potential of enriching EHR data for more genetic research. We share the source codes and system with the entire scientific community on GitHub without the corpus for privacy protection.
族谱信息,如家谱中所发现的,对于疾病遗传率和风险预测等生物医学研究至关重要。研究人员曾使用医疗保险参保人和其家属信息在医疗理赔数据中以及电子健康记录 (EHR) 中的紧急联系人来大规模推断家庭关系。我们之前曾证明,在线讣告可以成为构建更完整、更准确家谱的新数据源。
旨在通过补充 EHR 数据中的家庭关系来进行生物医学研究,我们使用基于多任务的人工神经网络模型构建了一个端到端信息提取系统,从在线讣告中构建基因族谱知识图 (GKG)。GKG 是一种详细信息丰富的家谱,包括年龄、性别、死亡和出生日期以及居住地。
我们构建在一个预定义的家庭关系图上,该图由 4 种实体(例如人名、居住地、出生日期和死亡日期)和 71 种关系组成,我们从明尼苏达州明尼阿波利斯和圣保罗的大都市区整理了包含 1700 份在线讣告的语料库。我们还采用了数据扩充技术来生成额外的合成数据,以缓解稀有家庭关系的数据稀缺问题。然后,我们构建了一个基于多任务的人工神经网络模型,以同时检测人名、提取人名之间的关系,并为每个人分配属性(例如出生日期和死亡日期、居住地、年龄和性别)。最后,我们通过识别出现在多个讣告中的人,将相关的 GKG 组装成更大的 GKG。
我们的系统在 10 折交叉验证中达到了令人满意的精度(94.79%)、召回率(91.45%)和 F1 度量值(93.09%)。我们还构建了 12407 个 GKG,其中最大的一个由 4 代 30 人组成。
在这项工作中,我们讨论了 GKG 对生物医学研究的意义,提出了一个带有预定义家庭关系图和扩充训练数据的新版本语料库,并提出了一个多任务深度神经网络系统来构建和组装 GKG。结果表明,我们的系统可以提取和展示丰富 EHR 数据以进行更多遗传研究的潜力。我们为了隐私保护没有在 GitHub 上共享语料库,而是向整个科学界共享了源代码和系统。