The University of Texas Health Science Center at Houston, USA.
Johns Hopkins University, USA.
Health Informatics J. 2020 Mar;26(1):21-33. doi: 10.1177/1460458219869490. Epub 2019 Sep 30.
Software tools now are essential to research and applications in the biomedical domain. However, existing software repositories are mainly built using manual curation, which is time-consuming and unscalable. This study took the initiative to manually annotate software names in 1,120 MEDLINE abstracts and titles and used this corpus to develop and evaluate machine learning-based named entity recognition systems for biomedical software. Specifically, two strategies were proposed for feature engineering: (1) domain knowledge features and (2) unsupervised word representation features of clustered and binarized word embeddings. Our best system achieved an F-measure of 91.79% for recognizing software from titles and an F-measure of 86.35% for recognizing software from both titles and abstracts using inexact matching criteria. We then created a biomedical software catalog with 19,557 entries using the developed system. This study demonstrates the feasibility of using natural language processing methods to automatically build a high-quality software index from biomedical literature.
软件工具现在是生物医学领域研究和应用的基础。然而,现有的软件库主要是通过手动策展构建的,这种方法既耗时又不可扩展。本研究主动对 1120 篇 MEDLINE 摘要和标题中的软件名称进行了手动注释,并使用该语料库开发和评估了基于机器学习的生物医学软件命名实体识别系统。具体来说,我们提出了两种特征工程策略:(1)领域知识特征和(2)聚类和二值化词嵌入的无监督词表示特征。我们的最佳系统在使用不精确匹配标准时,在从标题中识别软件方面的 F1 值达到了 91.79%,在从标题和摘要中识别软件方面的 F1 值达到了 86.35%。然后,我们使用开发的系统创建了一个包含 19557 条记录的生物医学软件目录。本研究证明了使用自然语言处理方法从生物医学文献中自动构建高质量软件索引的可行性。