从讣告构建族谱知识图谱：多任务神经网络提取系统。

Construction of Genealogical Knowledge Graphs From Obituaries: Multitask Neural Network Extraction System.

机构信息

School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, China.

National Engineering Lab for Big Data Analytics, Xi'an Jiaotong University, Xi'an, China.

出版信息

J Med Internet Res. 2021 Aug 4;23(8):e25670. doi: 10.2196/25670.

DOI:10.2196/25670

PMID:34346903

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8374669/

Abstract

BACKGROUND

Genealogical information, such as that found in family trees, is imperative for biomedical research such as disease heritability and risk prediction. Researchers have used policyholder and their dependent information in medical claims data and emergency contacts in electronic health records (EHRs) to infer family relationships at a large scale. We have previously demonstrated that online obituaries can be a novel data source for building more complete and accurate family trees.

OBJECTIVE

Aiming at supplementing EHR data with family relationships for biomedical research, we built an end-to-end information extraction system using a multitask-based artificial neural network model to construct genealogical knowledge graphs (GKGs) from online obituaries. GKGs are enriched family trees with detailed information including age, gender, death and birth dates, and residence.

METHODS

Built on a predefined family relationship map consisting of 4 types of entities (eg, people's name, residence, birth date, and death date) and 71 types of relationships, we curated a corpus containing 1700 online obituaries from the metropolitan area of Minneapolis and St Paul in Minnesota. We also adopted data augmentation technology to generate additional synthetic data to alleviate the issue of data scarcity for rare family relationships. A multitask-based artificial neural network model was then built to simultaneously detect names, extract relationships between them, and assign attributes (eg, birth dates and death dates, residence, age, and gender) to each individual. In the end, we assemble related GKGs into larger ones by identifying people appearing in multiple obituaries.

RESULTS

Our system achieved satisfying precision (94.79%), recall (91.45%), and F-1 measures (93.09%) on 10-fold cross-validation. We also constructed 12,407 GKGs, with the largest one made up of 4 generations and 30 people.

CONCLUSIONS

In this work, we discussed the meaning of GKGs for biomedical research, presented a new version of a corpus with a predefined family relationship map and augmented training data, and proposed a multitask deep neural system to construct and assemble GKGs. The results show our system can extract and demonstrate the potential of enriching EHR data for more genetic research. We share the source codes and system with the entire scientific community on GitHub without the corpus for privacy protection.

摘要

背景

族谱信息，如家谱中所发现的，对于疾病遗传率和风险预测等生物医学研究至关重要。研究人员曾使用医疗保险参保人和其家属信息在医疗理赔数据中以及电子健康记录 (EHR) 中的紧急联系人来大规模推断家庭关系。我们之前曾证明，在线讣告可以成为构建更完整、更准确家谱的新数据源。

目的

旨在通过补充 EHR 数据中的家庭关系来进行生物医学研究，我们使用基于多任务的人工神经网络模型构建了一个端到端信息提取系统，从在线讣告中构建基因族谱知识图 (GKG)。GKG 是一种详细信息丰富的家谱，包括年龄、性别、死亡和出生日期以及居住地。

方法

我们构建在一个预定义的家庭关系图上，该图由 4 种实体（例如人名、居住地、出生日期和死亡日期）和 71 种关系组成，我们从明尼苏达州明尼阿波利斯和圣保罗的大都市区整理了包含 1700 份在线讣告的语料库。我们还采用了数据扩充技术来生成额外的合成数据，以缓解稀有家庭关系的数据稀缺问题。然后，我们构建了一个基于多任务的人工神经网络模型，以同时检测人名、提取人名之间的关系，并为每个人分配属性（例如出生日期和死亡日期、居住地、年龄和性别）。最后，我们通过识别出现在多个讣告中的人，将相关的 GKG 组装成更大的 GKG。

结果

我们的系统在 10 折交叉验证中达到了令人满意的精度（94.79%）、召回率（91.45%）和 F1 度量值（93.09%）。我们还构建了 12407 个 GKG，其中最大的一个由 4 代 30 人组成。

结论

在这项工作中，我们讨论了 GKG 对生物医学研究的意义，提出了一个带有预定义家庭关系图和扩充训练数据的新版本语料库，并提出了一个多任务深度神经网络系统来构建和组装 GKG。结果表明，我们的系统可以提取和展示丰富 EHR 数据以进行更多遗传研究的潜力。我们为了隐私保护没有在 GitHub 上共享语料库，而是向整个科学界共享了源代码和系统。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a6fa/8374669/6f26ed4d45eb/jmir_v23i8e25670_fig1.jpg

相似文献

Construction of Genealogical Knowledge Graphs From Obituaries: Multitask Neural Network Extraction System.

J Med Internet Res. 2021 Aug 4;23(8):e25670. doi: 10.2196/25670.

Deep Denoising of Raw Biomedical Knowledge Graph From COVID-19 Literature, LitCovid, and Pubtator: Framework Development and Validation.

J Med Internet Res. 2022 Jul 6;24(7):e38584. doi: 10.2196/38584.

Construction of a knowledge graph for breast cancer diagnosis based on Chinese electronic medical records: development and usability study.

BMC Med Inform Decis Mak. 2023 Oct 10;23(1):210. doi: 10.1186/s12911-023-02322-0.

Graph Neural Network-Based Diagnosis Prediction.

Big Data. 2020 Oct;8(5):379-390. doi: 10.1089/big.2020.0070. Epub 2020 Aug 12.

A multitask bi-directional RNN model for named entity recognition on Chinese electronic medical records.

BMC Bioinformatics. 2018 Dec 28;19(Suppl 17):499. doi: 10.1186/s12859-018-2467-9.

Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks.

J Am Med Inform Assoc. 2020 Jan 1;27(1):89-98. doi: 10.1093/jamia/ocz153.

JCBIE: a joint continual learning neural network for biomedical information extraction.

BMC Bioinformatics. 2022 Dec 19;23(1):549. doi: 10.1186/s12859-022-05096-w.

Towards electronic health record-based medical knowledge graph construction, completion, and applications: A literature study.

J Biomed Inform. 2023 Jul;143:104403. doi: 10.1016/j.jbi.2023.104403. Epub 2023 May 24.

Adverse drug events and medication relation extraction in electronic health records with ensemble deep learning methods.

J Am Med Inform Assoc. 2020 Jan 1;27(1):39-46. doi: 10.1093/jamia/ocz101.

Constructing a Chinese electronic medical record corpus for named entity recognition on resident admit notes.

BMC Med Inform Decis Mak. 2019 Apr 9;19(Suppl 2):56. doi: 10.1186/s12911-019-0759-2.

引用本文的文献

Automated Extraction of Mortality Information From Publicly Available Sources Using Large Language Models: Development and Evaluation Study.

J Med Internet Res. 2025 Aug 18;27:e71113. doi: 10.2196/71113.

Health Care Language Models and Their Fine-Tuning for Information Extraction: Scoping Review.

JMIR Med Inform. 2024 Oct 21;12:e60164. doi: 10.2196/60164.

Sublinear information bottleneck based two-stage deep learning approach to genealogy layout recognition.

Front Neurosci. 2023 Jun 30;17:1230786. doi: 10.3389/fnins.2023.1230786. eCollection 2023.

JCBIE: a joint continual learning neural network for biomedical information extraction.

BMC Bioinformatics. 2022 Dec 19;23(1):549. doi: 10.1186/s12859-022-05096-w.

Construction and application of COVID-19 infectors activity information knowledge graph.

Comput Biol Med. 2022 Sep;148:105908. doi: 10.1016/j.compbiomed.2022.105908. Epub 2022 Jul 19.

本文引用的文献

Disease Heritability Inferred from Familial Relationships Reported in Medical Records.

Cell. 2018 Jun 14;173(7):1692-1704.e11. doi: 10.1016/j.cell.2018.04.032. Epub 2018 May 17.

Classification of common human diseases derived from shared genetic and environmental determinants.

Nat Genet. 2017 Sep;49(9):1319-1325. doi: 10.1038/ng.3931. Epub 2017 Aug 7.

The Medicare Access And CHIP Reauthorization Act And The Corporate Transformation Of American Medicine.

Health Aff (Millwood). 2017 May 1;36(5):865-869. doi: 10.1377/hlthaff.2016.1536.

Ethics and Privacy Implications of Using the Internet and Social Media to Recruit Participants for Health Research: A Privacy-by-Design Framework for Online Recruitment.

J Med Internet Res. 2017 Apr 6;19(4):e104. doi: 10.2196/jmir.7029.

A neural joint model for entity and relation extraction from biomedical text.

BMC Bioinformatics. 2017 Mar 31;18(1):198. doi: 10.1186/s12859-017-1609-9.

A novel web informatics approach for automated surveillance of cancer mortality trends.

J Biomed Inform. 2016 Jun;61:110-8. doi: 10.1016/j.jbi.2016.03.027. Epub 2016 Apr 1.

Residential Mobility and Lung Cancer Risk: Data-Driven Exploration Using Internet Sources.

Soc Comput Behav Cult Model Predict (2015). 2015 Mar-Apr;9021:464-469. doi: 10.1007/978-3-319-16268-3_60. Epub 2015 Mar 17.

Ethical Issues of Social Media Usage in Healthcare.

Yearb Med Inform. 2015 Aug 13;10(1):137-47. doi: 10.15265/IY-2015-001.

Use of an electronic medical record to create the marshfield clinic twin/multiple birth cohort.

Genet Epidemiol. 2014 Dec;38(8):692-8. doi: 10.1002/gepi.21855. Epub 2014 Sep 22.

Use of a medical records linkage system to enumerate a dynamic population over time: the Rochester epidemiology project.

Am J Epidemiol. 2011 May 1;173(9):1059-68. doi: 10.1093/aje/kwq482. Epub 2011 Mar 23.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

从讣告构建族谱知识图谱：多任务神经网络提取系统。

Construction of Genealogical Knowledge Graphs From Obituaries: Multitask Neural Network Extraction System.

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献