生物科学文献中多样化实验信息的自动分类。

Automatic categorization of diverse experimental information in the bioscience literature.

机构信息

Howard Hughes Medical Institute and Biology Division, California Institute of Technology, Pasadena, CA 91125, USA.

出版信息

BMC Bioinformatics. 2012 Jan 26;13:16. doi: 10.1186/1471-2105-13-16.

DOI:10.1186/1471-2105-13-16

PMID:22280404

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3305665/

Abstract

BACKGROUND

Curation of information from bioscience literature into biological knowledge databases is a crucial way of capturing experimental information in a computable form. During the biocuration process, a critical first step is to identify from all published literature the papers that contain results for a specific data type the curator is interested in annotating. This step normally requires curators to manually examine many papers to ascertain which few contain information of interest and thus, is usually time consuming. We developed an automatic method for identifying papers containing these curation data types among a large pool of published scientific papers based on the machine learning method Support Vector Machine (SVM). This classification system is completely automatic and can be readily applied to diverse experimental data types. It has been in use in production for automatic categorization of 10 different experimental datatypes in the biocuration process at WormBase for the past two years and it is in the process of being adopted in the biocuration process at FlyBase and the Saccharomyces Genome Database (SGD). We anticipate that this method can be readily adopted by various databases in the biocuration community and thereby greatly reducing time spent on an otherwise laborious and demanding task. We also developed a simple, readily automated procedure to utilize training papers of similar data types from different bodies of literature such as C. elegans and D. melanogaster to identify papers with any of these data types for a single database. This approach has great significance because for some data types, especially those of low occurrence, a single corpus often does not have enough training papers to achieve satisfactory performance.

RESULTS

We successfully tested the method on ten data types from WormBase, fifteen data types from FlyBase and three data types from Mouse Genomics Informatics (MGI). It is being used in the curation work flow at WormBase for automatic association of newly published papers with ten data types including RNAi, antibody, phenotype, gene regulation, mutant allele sequence, gene expression, gene product interaction, overexpression phenotype, gene interaction, and gene structure correction.

CONCLUSIONS

Our methods are applicable to a variety of data types with training set containing several hundreds to a few thousand documents. It is completely automatic and, thus can be readily incorporated to different workflow at different literature-based databases. We believe that the work presented here can contribute greatly to the tremendous task of automating the important yet labor-intensive biocuration effort.

摘要

背景

从生物科学文献中整理信息到生物知识数据库是捕获可计算形式的实验信息的关键方法。在生物整理过程中，关键的第一步是从所有已发表的文献中识别出包含整理者感兴趣的特定数据类型结果的论文。这一步通常需要整理者手动检查许多论文，以确定哪些论文包含感兴趣的信息，因此通常很耗时。我们开发了一种基于机器学习方法支持向量机（SVM）的方法，用于从大量已发表的科学论文中识别包含这些整理数据类型的论文。这种分类系统是完全自动的，可以很容易地应用于不同的实验数据类型。在过去的两年中，它已在 WormBase 的生物整理过程中用于自动分类 10 种不同的实验数据类型，并正在 FlyBase 和 Saccharomyces Genome Database（SGD）的生物整理过程中被采用。我们预计，这种方法可以很容易地被生物整理社区中的各种数据库采用，从而大大减少原本繁琐和要求高的任务所花费的时间。我们还开发了一种简单、易于自动化的程序，利用来自不同文献（如秀丽隐杆线虫和黑腹果蝇）的相似数据类型的训练论文，为单个数据库识别具有任何这些数据类型的论文。这种方法具有重要意义，因为对于某些数据类型，特别是那些出现频率较低的数据类型，单个语料库通常没有足够的训练论文来达到令人满意的性能。

结果

我们成功地在 WormBase 的十个数据类型、FlyBase 的十五个数据类型和 Mouse Genomics Informatics（MGI）的三个数据类型上测试了该方法。它正在 WormBase 的整理工作流程中使用，用于将新发表的论文与包括 RNAi、抗体、表型、基因调控、突变等位基因序列、基因表达、基因产物相互作用、过表达表型、基因相互作用和基因结构校正在内的十种数据类型自动关联。

结论

我们的方法适用于包含几百到几千个文档的训练集的各种数据类型。它是完全自动的，因此可以很容易地整合到不同文献数据库的不同工作流程中。我们相信，这里介绍的工作可以极大地促进自动化这一重要但劳动密集型的生物整理工作的艰巨任务。

相似文献

Automatic categorization of diverse experimental information in the bioscience literature.生物科学文献中多样化实验信息的自动分类。

BMC Bioinformatics. 2012 Jan 26;13:16. doi: 10.1186/1471-2105-13-16.

Text mining meets community curation: a newly designed curation platform to improve author experience and participation at WormBase.文本挖掘与社区策展相结合：一个新设计的策展平台，旨在改善 WormBase 的作者体验和参与度。

Database (Oxford). 2020 Jan 1;2020. doi: 10.1093/database/baaa006.

Integrating image caption information into biomedical document classification in support of biocuration.将图像标题信息整合到生物医学文献分类中，以支持生物注释。

Database (Oxford). 2020 Jan 1;2020. doi: 10.1093/database/baaa024.

Text mining in the biocuration workflow: applications for literature curation at WormBase, dictyBase and TAIR.生物注释工作流程中的文本挖掘：在 WormBase、dictyBase 和 TAIR 中进行文献注释的应用。

Database (Oxford). 2012 Nov 17;2012:bas040. doi: 10.1093/database/bas040. Print 2012.

Manual Gene Ontology annotation workflow at the Mouse Genome Informatics Database.在 Mouse Genome Informatics Database 中进行手动基因本体论注释工作流程。

Database (Oxford). 2012 Oct 29;2012:bas045. doi: 10.1093/database/bas045. Print 2012.

Accelerated variant curation from scientific literature using biomedical text mining.利用生物医学文本挖掘技术从科学文献中加速变异注释

MicroPubl Biol. 2022 Jun 1;2022. doi: 10.17912/micropub.biology.000578. eCollection 2022.

Supporting the curation of biological databases with reusable text mining.利用可重复使用的文本挖掘技术支持生物数据库的管理。

Genome Inform. 2005;16(2):32-44.

Wormicloud: a new text summarization tool based on word clouds to explore the C. elegans literature.Wormicloud：一种新的基于词云的文本摘要工具，用于探索秀丽隐杆线虫文献。

Database (Oxford). 2021 Mar 31;2021. doi: 10.1093/database/baab015.

Directly e-mailing authors of newly published papers encourages community curation.直接向新发表论文的作者发送电子邮件有助于社区策展。

Database (Oxford). 2012 May 2;2012:bas024. doi: 10.1093/database/bas024. Print 2012.

WormBase in 2022-data, processes, and tools for analyzing Caenorhabditis elegans.2022 年的 WormBase：用于分析秀丽隐杆线虫的数据库、流程和工具。

Genetics. 2022 Apr 4;220(4). doi: 10.1093/genetics/iyac003.

引用本文的文献

Characterization and automated classification of sentences in the biomedical literature: a case study for biocuration of gene expression and protein kinase activity.生物医学文献中句子的特征描述与自动分类：基因表达和蛋白激酶活性生物编目的案例研究

bioRxiv. 2025 Jan 8:2025.01.06.631539. doi: 10.1101/2025.01.06.631539.

Updates to the Alliance of Genome Resources central infrastructure.联盟基因组资源中心基础设施的更新。

Genetics. 2024 May 7;227(1). doi: 10.1093/genetics/iyae049.

Harmonizing model organism data in the Alliance of Genome Resources.在基因组资源联盟中协调模式生物数据。

Genetics. 2022 Apr 4;220(4). doi: 10.1093/genetics/iyac022.

FlyBase: a guided tour of highlighted features.FlyBase：特色功能导览

Genetics. 2022 Apr 4;220(4). doi: 10.1093/genetics/iyac035.

Utilizing image and caption information for biomedical document classification.利用图像和标题信息进行生物医学文献分类。

Bioinformatics. 2021 Jul 12;37(Suppl_1):i468-i476. doi: 10.1093/bioinformatics/btab331.

FlyBase: updates to the Drosophila melanogaster knowledge base.FlyBase：果蝇知识库的更新。

Nucleic Acids Res. 2021 Jan 8;49(D1):D899-D907. doi: 10.1093/nar/gkaa1026.

A behind-the-scenes tour of the IEDB curation process: an optimized process empirically integrating automation and human curation efforts.IEDB 编辑过程幕后探秘：自动化与人工编辑工作经验性整合的优化流程。

Immunology. 2020 Oct;161(2):139-147. doi: 10.1111/imm.13234. Epub 2020 Jul 26.

2018 Update on Protein-Protein Interaction Data in WormBase.《WormBase中蛋白质-蛋白质相互作用数据的2018年更新》

MicroPubl Biol. 2018 Nov 26;2018. doi: 10.17912/micropub.biology.000074.

Database (Oxford). 2020 Jan 1;2020. doi: 10.1093/database/baaa006.

WormBase: a modern Model Organism Information Resource.WormBase：现代模式生物信息资源。

Nucleic Acids Res. 2020 Jan 8;48(D1):D762-D767. doi: 10.1093/nar/gkz920.

本文引用的文献

Extraction of data deposition statements from the literature: a method for automatically tracking research results.从文献中提取数据提交声明：一种自动跟踪研究结果的方法。

Bioinformatics. 2011 Dec 1;27(23):3306-12. doi: 10.1093/bioinformatics/btr573. Epub 2011 Oct 13.

A MOD(ern) perspective on literature curation.现代视角下的文献整理。

Mol Genet Genomics. 2010 May;283(5):415-25. doi: 10.1007/s00438-010-0525-8. Epub 2010 Mar 11.

Integrating text mining into the MGI biocuration workflow.将文本挖掘整合到MGI生物编目工作流程中。

Database (Oxford). 2009;2009:bap019. doi: 10.1093/database/bap019. Epub 2009 Nov 21.

Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation.蛋白质亚细胞定位的半自动管理：一种基于文本挖掘的基因本体论（GO）细胞组分管理方法。

BMC Bioinformatics. 2009 Jul 21;10:228. doi: 10.1186/1471-2105-10-228.

Big data: The future of biocuration.大数据：生物编目的未来。

Nature. 2008 Sep 4;455(7209):47-50. doi: 10.1038/455047a.

Multi-dimensional classification of biomedical text: toward automated, practical provision of high-utility text to diverse users.生物医学文本的多维分类：致力于为不同用户自动提供实用价值高的文本。

Bioinformatics. 2008 Sep 15;24(18):2086-93. doi: 10.1093/bioinformatics/btn381. Epub 2008 Aug 20.

Seeking a new biology through text mining.通过文本挖掘寻找新的生物学。

Cell. 2008 Jul 11;134(1):9-13. doi: 10.1016/j.cell.2008.06.029.

New directions in biomedical text annotation: definitions, guidelines and corpus construction.生物医学文本注释的新方向：定义、指南与语料库构建

BMC Bioinformatics. 2006 Jul 25;7:356. doi: 10.1186/1471-2105-7-356.

The TREC 2004 genomics track categorization task: classifying full text biomedical documents.2004年文本检索会议（TREC）基因组学专题分类任务：对生物医学全文文档进行分类。

J Biomed Discov Collab. 2006 Mar 14;1:4. doi: 10.1186/1747-5333-1-4.

Literature mining for the biologist: from information retrieval to biological discovery.面向生物学家的文献挖掘：从信息检索到生物学发现

Nat Rev Genet. 2006 Feb;7(2):119-29. doi: 10.1038/nrg1768.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。