Suppr超能文献

评估人工智能(AI)在辅助基因关联方面的应用(于美国国立医学图书馆)

Assessing Artificial Intelligence (AI) Implementation for Assisting Gene Linking (at the National Library of Medicine).

作者信息

Islamaj Rezarta, Wei Chih-Hsuan, Lai Po-Ting, Huston Melanie, Coss Cathleen, Kochar Preeti Gokal, Miliaras Nicholas, Mork James G, Rodionov Oleg, Sekiya Keiko, Trinh Dorothy, Whitman Deborah, Wallin Craig, Lu Zhiyong

机构信息

National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, United States.

出版信息

JAMIA Open. 2025 Jan 7;8(1):ooae129. doi: 10.1093/jamiaopen/ooae129. eCollection 2025 Feb.

Abstract

OBJECTIVES

The National Library of Medicine (NLM) currently indexes close to a million articles each year pertaining to more than 5300 medicine and life sciences journals. Of these, a significant number of articles contain critical information about the structure, genetics, and function of genes and proteins in normal and disease states. These articles are identified by the NLM curators, and a manual link is created between these articles and the corresponding gene records at the NCBI Gene database. Thus, the information is interconnected with all the NLM resources, services which bring considerable value to life sciences. National Library of Medicine aims to provide timely access to all metadata, and this necessitates that the article indexing scales to the volume of the published literature. On the other hand, although automatic information extraction methods have been shown to achieve accurate results in biomedical text mining research, it remains difficult to evaluate them on established pipelines and integrate them within the daily workflows.

MATERIALS AND METHODS

Here, we demonstrate how our machine learning model, GNorm2, which achieved state-of-the art performance on identifying genes and their corresponding species at the same time handling innate textual ambiguities, could be integrated with the established daily workflow at the NLM and evaluated for its performance in this new environment.

RESULTS

We worked with 8 biomedical curator experts and evaluated the integration using these parameters: (1) gene identification accuracy, (2) interannotator agreement with and without GNorm2, (3) GNorm2 potential bias, and (4) indexing consistency and efficiency. We identified key interface changes that significantly helped the curators to maximize the GNorm2 benefit, and further improved the GNorm2 algorithm to cover 135 species of genes including viral and bacterial genes, based on the biocurator expert survey.

CONCLUSION

GNorm2 is currently in the process of being fully integrated into the regular curator's workflow.

摘要

目标

美国国立医学图书馆(NLM)目前每年对近100万篇文章进行索引,这些文章来自5300多种医学和生命科学期刊。其中,相当数量的文章包含有关正常和疾病状态下基因和蛋白质的结构、遗传学和功能的关键信息。这些文章由NLM的编目员识别,并在这些文章与NCBI基因数据库中的相应基因记录之间建立手动链接。因此,这些信息与所有NLM资源相互关联,这些资源和服务为生命科学带来了巨大价值。美国国立医学图书馆旨在及时提供所有元数据,这就要求文章索引能够适应已发表文献的数量。另一方面,尽管自动信息提取方法在生物医学文本挖掘研究中已被证明能取得准确的结果,但在既定的流程上对其进行评估并将其整合到日常工作流程中仍然很困难。

材料与方法

在此,我们展示了我们的机器学习模型GNorm2如何与NLM既定的日常工作流程相结合,并在这个新环境中评估其性能。GNorm2在同时识别基因及其相应物种并处理固有的文本歧义方面达到了先进水平。

结果

我们与8位生物医学编目专家合作,使用以下参数评估整合情况:(1)基因识别准确性,(2)有无GNorm2时的注释者间一致性,(3)GNorm2潜在偏差,以及(4)索引一致性和效率。我们确定了关键的界面更改,这些更改显著帮助编目员最大限度地利用GNorm2的优势,并根据生物编目专家的调查进一步改进了GNorm2算法,以涵盖包括病毒和细菌基因在内的135种基因。

结论

GNorm2目前正处于全面整合到常规编目员工作流程的过程中。

相似文献

1
Assessing Artificial Intelligence (AI) Implementation for Assisting Gene Linking (at the National Library of Medicine).
JAMIA Open. 2025 Jan 7;8(1):ooae129. doi: 10.1093/jamiaopen/ooae129. eCollection 2025 Feb.
3
GNorm2: an improved gene name recognition and normalization system.
Bioinformatics. 2023 Oct 3;39(10). doi: 10.1093/bioinformatics/btad599.
5
NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition.
J Biomed Inform. 2021 Jun;118:103779. doi: 10.1016/j.jbi.2021.103779. Epub 2021 Apr 9.
6
NCBI disease corpus: a resource for disease name recognition and concept normalization.
J Biomed Inform. 2014 Feb;47:1-10. doi: 10.1016/j.jbi.2013.12.006. Epub 2014 Jan 3.
8
BioCreative III interactive task: an overview.
BMC Bioinformatics. 2011 Oct 3;12 Suppl 8(Suppl 8):S4. doi: 10.1186/1471-2105-12-S8-S4.
9
Integrating text mining into the MGI biocuration workflow.
Database (Oxford). 2009;2009:bap019. doi: 10.1093/database/bap019. Epub 2009 Nov 21.
10
Egas: a collaborative and interactive document curation platform.
Database (Oxford). 2014 Jun 11;2014. doi: 10.1093/database/bau048. Print 2014.

引用本文的文献

1
A standards perspective on genomic data reusability and reproducibility.
Front Bioinform. 2025 Mar 10;5:1572937. doi: 10.3389/fbinf.2025.1572937. eCollection 2025.

本文引用的文献

1
Advancing entity recognition in biomedicine via instruction tuning of large language models.
Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae163.
2
3
GNorm2: an improved gene name recognition and normalization system.
Bioinformatics. 2023 Oct 3;39(10). doi: 10.1093/bioinformatics/btad599.
4
NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition.
J Biomed Inform. 2021 Jun;118:103779. doi: 10.1016/j.jbi.2021.103779. Epub 2021 Apr 9.
5
GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains.
Biomed Res Int. 2015;2015:918710. doi: 10.1155/2015/918710. Epub 2015 Aug 25.
6
The GNAT library for local and remote gene mention normalization.
Bioinformatics. 2011 Oct 1;27(19):2769-71. doi: 10.1093/bioinformatics/btr455. Epub 2011 Aug 3.
7
GeneTUKit: a software for document-level gene normalization.
Bioinformatics. 2011 Apr 1;27(7):1032-3. doi: 10.1093/bioinformatics/btr042. Epub 2011 Feb 8.
9
The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003.
Nucleic Acids Res. 2003 Jan 1;31(1):365-70. doi: 10.1093/nar/gkg095.
10
The Protein Data Bank.
Acta Crystallogr D Biol Crystallogr. 2002 Jun;58(Pt 6 No 1):899-907. doi: 10.1107/s0907444902003451. Epub 2002 May 29.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验