Sedler Andrew R, Mitchell Cassie S
Laboratory for Pathology Dynamics, Department of Biomedical Engineering, Georgia Institute of Technology, Emory University School of Medicine, Atlanta, GA, United States.
Front Bioeng Biotechnol. 2019 Jul 3;7:156. doi: 10.3389/fbioe.2019.00156. eCollection 2019.
Literature-Based Discovery (LBD) aims to connect scientists across silos by assembling models of the literature to reveal previously hidden connections. Unfortunately, LBD systems have been unable to achieve user adoption on a large scale. This work develops opens source software in Python to convert a database of semantic predications of all of PubMed's 27.9 million indexed abstracts into a semantic inference network and biomedical concept graph in Neo4j. The developed software, called SemNet, queries a modified version of the publicly available SemMedDB and computes feature vectors on source-target pairs. Each unique United Medical Language System (UMLS) concept is represented as a node and each predication as an edge. Each node is assigned one of 132 node labels (e.g., Amino Acid, Peptide, or Protein (AAPP); Gene or Genome (GG); etc.) and each edge is labeled with one of 58 predications (e.g. treats, causes, inhibits, etc.). SemNet computes a single feature value for each metapath, or sequence of node types, between a source node and user-specified target node(s). Several different types of metapath-based features (count, degree weighted path count, and HeteSim metric) are computed and vectorized. SemNet employs an unsupervised learning algorithm for rank aggregation (ULARA) to rank identified source nodes that are most relevant to the user-specified target nodes(s). Statistical analysis of correlation among identified source nodes or resultant literature network features are used to identify patterns that can guide future research. Analysis of high residual nodes is used to compare and contrast SemNet rankings between different targets of interest. An example SemNet use case is presented to assess "the differential impact of smoking on cognition in males and females" using the following target nodes: nicotine, learning, memory, tetrahydrocannabinol (THC), cigarette smoke, X chromosome, and Y chromosome. Detailed rankings are discussed. Overall results suggest a hypothesis where smoking negatively impacts cognition to a greater extent in females, but smoking has stronger cardiovascular impacts in males. In summary, SemNet provides an adoptable method for efficient LBD of PubMed that extends beyond omics-only relationships to true multi-scalar connections that can provide actionable insight for predictive medicine, research prioritization, and clinical care.
基于文献的发现(LBD)旨在通过整合文献模型来揭示先前隐藏的联系,从而将不同领域的科学家联系起来。不幸的是,LBD系统尚未能够在大规模上实现用户采用。这项工作开发了用Python编写的开源软件,将PubMed的2790万篇索引摘要的语义预测数据库转换为Neo4j中的语义推理网络和生物医学概念图。所开发的软件名为SemNet,查询公开可用的SemMedDB的修改版本,并在源-目标对上计算特征向量。每个独特的统一医学语言系统(UMLS)概念都表示为一个节点,每个预测表示为一条边。每个节点被分配132个节点标签之一(例如,氨基酸、肽或蛋白质(AAPP);基因或基因组(GG)等),每条边用58个预测之一标记(例如,治疗、导致、抑制等)。SemNet为源节点和用户指定的目标节点之间的每个元路径或节点类型序列计算一个单一特征值。计算并向量化几种不同类型的基于元路径的特征(计数、度加权路径计数和HeteSim度量)。SemNet采用无监督学习算法进行排名聚合(ULARA),对与用户指定的目标节点最相关的已识别源节点进行排名。对已识别源节点或所得文献网络特征之间的相关性进行统计分析,以识别可指导未来研究的模式。对高残差节点的分析用于比较和对比不同感兴趣目标之间的SemNet排名。给出了一个SemNet用例,使用以下目标节点评估“吸烟对男性和女性认知的差异影响”:尼古丁、学习、记忆、四氢大麻酚(THC)、香烟烟雾、X染色体和Y染色体。讨论了详细排名。总体结果提出了一个假设,即吸烟对女性认知的负面影响更大,但吸烟对男性心血管的影响更强。总之,SemNet为PubMed的高效LBD提供了一种可采用的方法,该方法超越了仅组学关系,扩展到真正的多标量连接,可为预测医学、研究优先级确定和临床护理提供可操作的见解。