Shanghai Shuguang Hospital Affiliated to Shanghai University of Traditional Chinese Medicine, Pu'an Road, Shanghai, China.
Shanghai Leyan Technologies Co. Ltd, No. 1028 Panyu Road, Shanghai, China.
BMC Med Inform Decis Mak. 2019 Apr 9;19(Suppl 2):49. doi: 10.1186/s12911-019-0771-6.
Diabetes has become one of the hot topics in life science researches. To support the analytical procedures, researchers and analysts expend a mass of labor cost to collect experimental data, which is also error-prone. To reduce the cost and to ensure the data quality, there is a growing trend of extracting clinical events in form of knowledge from electronic medical records (EMRs). To do so, we first need a high-coverage knowledge base (KB) of a specific disease to support the above extraction tasks called KB-based Extraction.
We propose an approach to build a diabetes-centric knowledge base (a.k.a. DKB) via mining the Web. In particular, we first extract knowledge from semi-structured contents of vertical portals, fuse individual knowledge from each site, and further map them to a unified KB. The target DKB is then extracted from the overall KB based on a distance-based Expectation-Maximization (EM) algorithm.
During the experiments, we selected eight popular vertical portals in China as data sources to construct DKB. There are 7703 instances and 96,041 edges in the final diabetes KB covering diseases, symptoms, western medicines, traditional Chinese medicines, examinations, departments, and body structures. The accuracy of DKB is 95.91%. Besides the quality assessment of extracted knowledge from vertical portals, we also carried out detailed experiments for evaluating the knowledge fusion performance as well as the convergence of the distance-based EM algorithm with positive results.
In this paper, we introduced an approach to constructing DKB. A knowledge extraction and fusion pipeline was first used to extract semi-structured data from vertical portals and individual KBs were further fused into a unified knowledge base. After that, we develop a distance based Expectation Maximization algorithm to extract a subset from the overall knowledge base forming the target DKB. Experiments showed that the data in DKB are rich and of high-quality.
糖尿病已成为生命科学研究中的热门话题之一。为了支持分析过程,研究人员和分析人员花费大量的劳动成本来收集实验数据,这也容易出错。为了降低成本并确保数据质量,越来越倾向于从电子病历(EMR)中以知识的形式提取临床事件。为此,我们首先需要一个特定疾病的高覆盖率知识库(KB)来支持上述提取任务,称为基于 KB 的提取。
我们提出了一种通过挖掘网络构建以糖尿病为中心的知识库(即 DKB)的方法。具体来说,我们首先从垂直门户的半结构化内容中提取知识,融合每个站点的个体知识,并进一步将其映射到统一的 KB。然后,基于基于距离的期望最大化(EM)算法从总体 KB 中提取目标 DKB。
在实验过程中,我们选择了中国的八个流行的垂直门户作为数据源来构建 DKB。最终的糖尿病 KB 包含疾病、症状、西药、中药、检查、科室和身体结构,共有 7703 个实例和 96041 条边。DKB 的准确率为 95.91%。除了评估从垂直门户提取的知识的质量外,我们还进行了详细的实验,以评估知识融合性能以及基于距离的 EM 算法的收敛性,结果均为正。
本文介绍了一种构建 DKB 的方法。首先使用知识提取和融合管道从垂直门户中提取半结构化数据,然后将各个 KB 进一步融合到统一的知识库中。之后,我们开发了一种基于距离的期望最大化算法,从整体知识库中提取一个子集,形成目标 DKB。实验表明,DKB 中的数据丰富且质量高。