Islamaj Rezarta, Wei Chih-Hsuan, Lai Po-Ting, Huston Melanie, Coss Cathleen, Kochar Preeti Gokal, Miliaras Nicholas, Mork James G, Rodionov Oleg, Sekiya Keiko, Trinh Dorothy, Whitman Deborah, Wallin Craig, Lu Zhiyong
National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, United States.
JAMIA Open. 2025 Jan 7;8(1):ooae129. doi: 10.1093/jamiaopen/ooae129. eCollection 2025 Feb.
The National Library of Medicine (NLM) currently indexes close to a million articles each year pertaining to more than 5300 medicine and life sciences journals. Of these, a significant number of articles contain critical information about the structure, genetics, and function of genes and proteins in normal and disease states. These articles are identified by the NLM curators, and a manual link is created between these articles and the corresponding gene records at the NCBI Gene database. Thus, the information is interconnected with all the NLM resources, services which bring considerable value to life sciences. National Library of Medicine aims to provide timely access to all metadata, and this necessitates that the article indexing scales to the volume of the published literature. On the other hand, although automatic information extraction methods have been shown to achieve accurate results in biomedical text mining research, it remains difficult to evaluate them on established pipelines and integrate them within the daily workflows.
Here, we demonstrate how our machine learning model, GNorm2, which achieved state-of-the art performance on identifying genes and their corresponding species at the same time handling innate textual ambiguities, could be integrated with the established daily workflow at the NLM and evaluated for its performance in this new environment.
We worked with 8 biomedical curator experts and evaluated the integration using these parameters: (1) gene identification accuracy, (2) interannotator agreement with and without GNorm2, (3) GNorm2 potential bias, and (4) indexing consistency and efficiency. We identified key interface changes that significantly helped the curators to maximize the GNorm2 benefit, and further improved the GNorm2 algorithm to cover 135 species of genes including viral and bacterial genes, based on the biocurator expert survey.
GNorm2 is currently in the process of being fully integrated into the regular curator's workflow.
美国国立医学图书馆(NLM)目前每年对近100万篇文章进行索引,这些文章来自5300多种医学和生命科学期刊。其中,相当数量的文章包含有关正常和疾病状态下基因和蛋白质的结构、遗传学和功能的关键信息。这些文章由NLM的编目员识别,并在这些文章与NCBI基因数据库中的相应基因记录之间建立手动链接。因此,这些信息与所有NLM资源相互关联,这些资源和服务为生命科学带来了巨大价值。美国国立医学图书馆旨在及时提供所有元数据,这就要求文章索引能够适应已发表文献的数量。另一方面,尽管自动信息提取方法在生物医学文本挖掘研究中已被证明能取得准确的结果,但在既定的流程上对其进行评估并将其整合到日常工作流程中仍然很困难。
在此,我们展示了我们的机器学习模型GNorm2如何与NLM既定的日常工作流程相结合,并在这个新环境中评估其性能。GNorm2在同时识别基因及其相应物种并处理固有的文本歧义方面达到了先进水平。
我们与8位生物医学编目专家合作,使用以下参数评估整合情况:(1)基因识别准确性,(2)有无GNorm2时的注释者间一致性,(3)GNorm2潜在偏差,以及(4)索引一致性和效率。我们确定了关键的界面更改,这些更改显著帮助编目员最大限度地利用GNorm2的优势,并根据生物编目专家的调查进一步改进了GNorm2算法,以涵盖包括病毒和细菌基因在内的135种基因。
GNorm2目前正处于全面整合到常规编目员工作流程的过程中。