• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用文本挖掘对系统生物学论文中的物种进行分类

Towards classifying species in systems biology papers using text mining.

作者信息

Wei Qi, Collier Nigel

机构信息

Department of Informatics, The Graduate University for Advanced Studies (Sokendai), 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo.

出版信息

BMC Res Notes. 2011 Feb 4;4:32. doi: 10.1186/1756-0500-4-32.

DOI:10.1186/1756-0500-4-32
PMID:21294879
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3045319/
Abstract

BACKGROUND

In recent years high throughput methods have led to a massive expansion in the free text literature on molecular biology. Automated text mining has developed as an application technology for formalizing this wealth of published results into structured database entries. However, database curation as a task is still largely done by hand, and although there have been many studies on automated approaches, problems remain in how to classify documents into top-level categories based on the type of organism being investigated. Here we present a comparative analysis of state of the art supervised models that are used to classify both abstracts and full text articles for three model organisms.

RESULTS

Ablation experiments were conducted on a large gold standard corpus of 10,000 abstracts and full papers containing data on three model organisms (fly, mouse and yeast). Among the eight learner models tested, the best model achieved an F-score of 97.1% for fly, 88.6% for mouse and 85.5% for yeast using a variety of features that included gene name, organism frequency, MeSH headings and term-species associations. We noted that term-species associations were particularly effective in improving classification performance. The benefit of using full text articles over abstracts was consistently observed across all three organisms.

CONCLUSIONS

By comparing various learner algorithms and features we presented an optimized system that automatically detects the major focus organism in full text articles for fly, mouse and yeast. We believe the method will be extensible to other organism types.

摘要

背景

近年来,高通量方法使得分子生物学领域的自由文本文献大量增加。自动文本挖掘作为一种应用技术应运而生,旨在将大量已发表的研究成果整理成结构化的数据库条目。然而,数据库管理工作在很大程度上仍需人工完成,尽管已经有许多关于自动化方法的研究,但在如何根据所研究生物体的类型将文档分类到顶级类别方面仍然存在问题。在此,我们对用于对三种模式生物的摘要和全文进行分类的现有监督模型进行了比较分析。

结果

我们在一个包含10000篇摘要和全文的大型金标准语料库上进行了消融实验,这些文献包含三种模式生物(果蝇、小鼠和酵母)的数据。在所测试的八个学习模型中,最佳模型使用包括基因名称、生物体频率、医学主题词(MeSH)和术语-物种关联等多种特征,对果蝇的F值达到97.1%,对小鼠为88.6%,对酵母为85.5%。我们注意到术语-物种关联在提高分类性能方面特别有效。在所有三种生物体中,始终观察到使用全文比使用摘要更具优势。

结论

通过比较各种学习算法和特征,我们提出了一个优化系统,该系统能够自动检测果蝇、小鼠和酵母全文中的主要研究生物体。我们相信该方法将可扩展到其他生物体类型。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3598/3045319/dff2a2fcbcdb/1756-0500-4-32-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3598/3045319/dff2a2fcbcdb/1756-0500-4-32-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3598/3045319/dff2a2fcbcdb/1756-0500-4-32-1.jpg

相似文献

1
Towards classifying species in systems biology papers using text mining.利用文本挖掘对系统生物学论文中的物种进行分类
BMC Res Notes. 2011 Feb 4;4:32. doi: 10.1186/1756-0500-4-32.
2
The TREC 2004 genomics track categorization task: classifying full text biomedical documents.2004年文本检索会议(TREC)基因组学专题分类任务:对生物医学全文文档进行分类。
J Biomed Discov Collab. 2006 Mar 14;1:4. doi: 10.1186/1747-5333-1-4.
3
Beyond the black stump: rapid reviews of health research issues affecting regional, rural and remote Australia.超越黑木树:影响澳大利亚地区、农村和偏远地区的健康研究问题的快速综述。
Med J Aust. 2020 Dec;213 Suppl 11:S3-S32.e1. doi: 10.5694/mja2.50881.
4
Text mining facilitates database curation - extraction of mutation-disease associations from Bio-medical literature.文本挖掘有助于数据库管理——从生物医学文献中提取突变与疾病的关联。
BMC Bioinformatics. 2015 Jun 6;16:185. doi: 10.1186/s12859-015-0609-x.
5
Data preparation and interannotator agreement: BioCreAtIvE task 1B.数据准备与注释者间一致性:生物创意任务1B
BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S12. doi: 10.1186/1471-2105-6-S1-S12. Epub 2005 May 24.
6
Overview of BioCreAtIvE task 1B: normalized gene lists.生物创意任务1B概述:标准化基因列表。
BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S11. doi: 10.1186/1471-2105-6-S1-S11. Epub 2005 May 24.
7
Automatically classifying sentences in full-text biomedical articles into introduction, methods, results and discussion.将全文生物医学文章中的句子自动分类为引言、方法、结果和讨论部分。
Summit Transl Bioinform. 2009 Mar 1;2009:6-10.
8
Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation.蛋白质亚细胞定位的半自动管理:一种基于文本挖掘的基因本体论(GO)细胞组分管理方法。
BMC Bioinformatics. 2009 Jul 21;10:228. doi: 10.1186/1471-2105-10-228.
9
Overview of BioCreAtIvE: critical assessment of information extraction for biology.生物创意(BioCreAtIvE)概述:生物学信息提取的批判性评估
BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S1. doi: 10.1186/1471-2105-6-S1-S1. Epub 2005 May 24.
10
Automatically classifying sentences in full-text biomedical articles into Introduction, Methods, Results and Discussion.自动将全文生物医学文章中的句子分类为引言、方法、结果和讨论。
Bioinformatics. 2009 Dec 1;25(23):3174-80. doi: 10.1093/bioinformatics/btp548. Epub 2009 Sep 25.

引用本文的文献

1
GeneCup: mining PubMed and GWAS catalog for gene-keyword relationships.GeneCup:从 PubMed 和 GWAS 目录中挖掘基因-关键词关系。
G3 (Bethesda). 2022 May 6;12(5). doi: 10.1093/g3journal/jkac059.
2
Text Categorization of Heart, Lung, and Blood Studies in the Database of Genotypes and Phenotypes (dbGaP) Utilizing n-grams and Metadata Features.利用n元语法和元数据特征对基因型和表型数据库(dbGaP)中的心脏、肺和血液研究进行文本分类。
Biomed Inform Insights. 2013 Jul 22;6:35-45. doi: 10.4137/BII.S11987. Print 2013.

本文引用的文献

1
Is searching full text more effective than searching abstracts?搜索全文比搜索摘要更有效吗?
BMC Bioinformatics. 2009 Feb 3;10:46. doi: 10.1186/1471-2105-10-46.
2
Distinguishing the species of biomedical named entities for term identification.区分生物医学命名实体的物种以进行术语识别。
BMC Bioinformatics. 2008 Nov 19;9 Suppl 11(Suppl 11):S6. doi: 10.1186/1471-2105-9-S11-S6.
3
OntoGene in BioCreative II.生物创意II中的OntoGene。
Genome Biol. 2008;9 Suppl 2(Suppl 2):S13. doi: 10.1186/gb-2008-9-s2-s13. Epub 2008 Sep 1.
4
Overview of BioCreAtIvE: critical assessment of information extraction for biology.生物创意(BioCreAtIvE)概述:生物学信息提取的批判性评估
BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S1. doi: 10.1186/1471-2105-6-S1-S1. Epub 2005 May 24.
5
Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup.用于数据库管理的文本数据挖掘评估:从知识发现与数据挖掘竞赛杯赛中学到的经验教训。
Bioinformatics. 2003;19 Suppl 1:i331-9. doi: 10.1093/bioinformatics/btg1046.
6
PreBIND and Textomy--mining the biomedical literature for protein-protein interactions using a support vector machine.PreBIND和Textomy——使用支持向量机挖掘生物医学文献中的蛋白质-蛋白质相互作用。
BMC Bioinformatics. 2003 Mar 27;4:11. doi: 10.1186/1471-2105-4-11.
7
MGD: the Mouse Genome Database.MGD:小鼠基因组数据库。
Nucleic Acids Res. 2003 Jan 1;31(1):193-5. doi: 10.1093/nar/gkg047.
8
The FlyBase database of the Drosophila genome projects and community literature.果蝇基因组计划及相关文献的FlyBase数据库。
Nucleic Acids Res. 2003 Jan 1;31(1):172-5. doi: 10.1093/nar/gkg094.
9
Tagging gene and protein names in biomedical text.在生物医学文本中标记基因和蛋白质名称。
Bioinformatics. 2002 Aug;18(8):1124-32. doi: 10.1093/bioinformatics/18.8.1124.
10
Saccharomyces Genome Database.酿酒酵母基因组数据库。
Methods Enzymol. 2002;350:329-46. doi: 10.1016/s0076-6879(02)50972-1.