Language Technology Lab, Department of Theoretical and Applied Linguistics, University of Cambridge, Cambridge, UK.
Institute of Environmental Medicine, Karolinska Institutet, Stockholm, Sweden.
Bioinformatics. 2019 May 1;35(9):1553-1561. doi: 10.1093/bioinformatics/bty845.
The overwhelming size and rapid growth of the biomedical literature make it impossible for scientists to read all studies related to their work, potentially leading to missed connections and wasted time and resources. Literature-based discovery (LBD) aims to alleviate these issues by identifying implicit links between disjoint parts of the literature. While LBD has been studied in depth since its introduction three decades ago, there has been limited work making use of recent advances in biomedical text processing methods in LBD.
We present LION LBD, a literature-based discovery system that enables researchers to navigate published information and supports hypothesis generation and testing. The system is built with a particular focus on the molecular biology of cancer using state-of-the-art machine learning and natural language processing methods, including named entity recognition and grounding to domain ontologies covering a wide range of entity types and a novel approach to detecting references to the hallmarks of cancer in text. LION LBD implements a broad selection of co-occurrence based metrics for analyzing the strength of entity associations, and its design allows real-time search to discover indirect associations between entities in a database of tens of millions of publications while preserving the ability of users to explore each mention in its original context in the literature. Evaluations of the system demonstrate its ability to identify undiscovered links and rank relevant concepts highly among potential connections.
The LION LBD system is available via a web-based user interface and a programmable API, and all components of the system are made available under open licenses from the project home page http://lbd.lionproject.net.
Supplementary data are available at Bioinformatics online.
生物医学文献的规模庞大且增长迅速,使得科学家不可能阅读所有与他们工作相关的研究,这可能导致错失联系和浪费时间与资源。文献基础发现(LBD)旨在通过识别文献中不相关部分之间的隐含联系来缓解这些问题。虽然自三十年前引入以来,LBD 已经进行了深入研究,但在 LBD 中利用生物医学文本处理方法的最新进展的工作却很有限。
我们提出了 LION LBD,这是一个文献基础发现系统,使研究人员能够浏览已发表的信息,并支持假设的生成和测试。该系统特别关注癌症的分子生物学,使用最先进的机器学习和自然语言处理方法,包括命名实体识别和对涵盖广泛实体类型的领域本体的基础,以及一种检测文本中癌症标志的新方法。LION LBD 实现了广泛的基于共现的指标,用于分析实体关联的强度,其设计允许实时搜索在数千万篇文献的数据库中发现实体之间的间接关联,同时保留用户在文献中原始上下文中探索每个提及的能力。对该系统的评估表明,它能够识别未发现的联系,并在潜在联系中高度排名相关概念。
LION LBD 系统可通过基于网络的用户界面和可编程 API 使用,系统的所有组件都可从项目主页 http://lbd.lionproject.net 获得开放许可证。
补充数据可在生物信息学在线获得。