使用基于术语的支持向量机从文本中挖掘蛋白质功能。

Mining protein function from text using term-based support vector machines.

作者信息

Rice Simon B, Nenadic Goran, Stapley Benjamin J

机构信息

Faculty of Life Sciences, University of Manchester, UK.

出版信息

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S22. doi: 10.1186/1471-2105-6-S1-S22. Epub 2005 May 24.

DOI:10.1186/1471-2105-6-S1-S22

PMID:15960835

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1869015/

Abstract

BACKGROUND

Text mining has spurred huge interest in the domain of biology. The goal of the BioCreAtIvE exercise was to evaluate the performance of current text mining systems. We participated in Task 2, which addressed assigning Gene Ontology terms to human proteins and selecting relevant evidence from full-text documents. We approached it as a modified form of the document classification task. We used a supervised machine-learning approach (based on support vector machines) to assign protein function and select passages that support the assignments. As classification features, we used a protein's co-occurring terms that were automatically extracted from documents.

RESULTS

The results evaluated by curators were modest, and quite variable for different problems: in many cases we have relatively good assignment of GO terms to proteins, but the selected supporting text was typically non-relevant (precision spanning from 3% to 50%). The method appears to work best when a substantial set of relevant documents is obtained, while it works poorly on single documents and/or short passages. The initial results suggest that our approach can also mine annotations from text even when an explicit statement relating a protein to a GO term is absent.

CONCLUSION

A machine learning approach to mining protein function predictions from text can yield good performance only if sufficient training data is available, and significant amount of supporting data is used for prediction. The most promising results are for combined document retrieval and GO term assignment, which calls for the integration of methods developed in BioCreAtIvE Task 1 and Task 2.

摘要

背景

文本挖掘在生物学领域引发了极大的兴趣。生物创意（BioCreAtIvE）项目的目标是评估当前文本挖掘系统的性能。我们参与了任务2，该任务涉及为人类蛋白质分配基因本体（Gene Ontology）术语，并从全文文档中选择相关证据。我们将其作为文档分类任务的一种改进形式来处理。我们使用了一种监督式机器学习方法（基于支持向量机）来分配蛋白质功能并选择支持这些分配的段落。作为分类特征，我们使用了从文档中自动提取的与蛋白质共同出现的术语。

结果

由管理员评估的结果一般，并且因不同问题差异很大：在许多情况下，我们对蛋白质的基因本体术语分配相对较好，但所选的支持文本通常不相关（精确率从3%到50%不等）。当获得大量相关文档时，该方法似乎效果最佳，而在单个文档和/或短段落上效果较差。初步结果表明，即使在没有将蛋白质与基因本体术语相关联的明确陈述时，我们的方法也能从文本中挖掘注释。

结论

只有当有足够的训练数据可用，并且大量支持数据用于预测时，一种从文本中挖掘蛋白质功能预测的机器学习方法才能产生良好的性能。最有前景的结果是用于文档检索和基因本体术语分配的结合，这需要整合在生物创意任务1和任务2中开发的方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dd39/1869015/513190e24817/1471-2105-6-S1-S22-1.jpg

相似文献

Mining protein function from text using term-based support vector machines.使用基于术语的支持向量机从文本中挖掘蛋白质功能。

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S22. doi: 10.1186/1471-2105-6-S1-S22. Epub 2005 May 24.

Evaluation of BioCreAtIvE assessment of task 2.生物创意任务2评估的评价

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S16. doi: 10.1186/1471-2105-6-S1-S16. Epub 2005 May 24.

Overview of the BioCreative III Workshop.第三届生物创意研讨会概述。

BMC Bioinformatics. 2011 Oct 3;12 Suppl 8(Suppl 8):S1. doi: 10.1186/1471-2105-12-S8-S1.

Finding genomic ontology terms in text using evidence content.利用证据内容在文本中查找基因组本体术语。

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S21. doi: 10.1186/1471-2105-6-S1-S21. Epub 2005 May 24.

Overview of the gene ontology task at BioCreative IV.生物创意IV基因本体任务概述。

Database (Oxford). 2014 Aug 25;2014. doi: 10.1093/database/bau086. Print 2014.

An evaluation of GO annotation retrieval for BioCreAtIvE and GOA.对生物创意（BioCreAtIvE）和基因本体注释（GOA）的基因本体（GO）注释检索的评估。

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S17. doi: 10.1186/1471-2105-6-S1-S17. Epub 2005 May 24.

A sentence sliding window approach to extract protein annotations from biomedical articles.一种用于从生物医学文章中提取蛋白质注释的句子滑动窗口方法。

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S19. doi: 10.1186/1471-2105-6-S1-S19. Epub 2005 May 24.

Learning statistical models for annotating proteins with function information using biomedical text.利用生物医学文本学习用于用功能信息注释蛋白质的统计模型。

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S18. doi: 10.1186/1471-2105-6-S1-S18. Epub 2005 May 24.

The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text.BioCreative III 的蛋白质-蛋白质相互作用任务：文章的分类/排序和将生物本体论概念链接到全文。

BMC Bioinformatics. 2011 Oct 3;12 Suppl 8(Suppl 8):S3. doi: 10.1186/1471-2105-12-S8-S3.

Overview of the BioCreative VI Precision Medicine Track: mining protein interactions and mutations for precision medicine.BioCreative VI 精准医学赛道概述：精准医学中的蛋白质相互作用和突变挖掘。

Database (Oxford). 2019 Jan 1;2019:bay147. doi: 10.1093/database/bay147.

引用本文的文献

On the interpretability of the SVM model for predicting infant mortality in Bangladesh.关于 SVM 模型预测孟加拉国婴儿死亡率的可解释性。

J Health Popul Nutr. 2024 Oct 27;43(1):170. doi: 10.1186/s41043-024-00646-9.

GADD45A and GADD45B as Novel Biomarkers Associated with Chromatin Regulators in Renal Ischemia-Reperfusion Injury.GADD45A 和 GADD45B 作为与肾缺血再灌注损伤中染色质调节剂相关的新型生物标志物。

Int J Mol Sci. 2023 Jul 11;24(14):11304. doi: 10.3390/ijms241411304.

A Novel Necroptosis-Related Prognostic Signature of Glioblastoma Based on Transcriptomics Analysis and Single Cell Sequencing Analysis.基于转录组学分析和单细胞测序分析的胶质母细胞瘤新型坏死性凋亡相关预后标志物

Brain Sci. 2022 Jul 26;12(8):988. doi: 10.3390/brainsci12080988.

SVMRFE based approach for prediction of most discriminatory gene target for type II diabetes.基于支持向量机递归特征消除法的II型糖尿病最具鉴别力基因靶点预测方法

Genom Data. 2017 Feb 17;12:28-37. doi: 10.1016/j.gdata.2017.02.008. eCollection 2017 Jun.

Integrative analysis reveals disease-associated genes and biomarkers for prostate cancer progression.综合分析揭示了前列腺癌进展的疾病相关基因和生物标志物。

BMC Med Genomics. 2014;7 Suppl 1(Suppl 1):S3. doi: 10.1186/1755-8794-7-S1-S3. Epub 2014 May 8.

Application of the support vector machine to predict subclinical mastitis in dairy cattle.支持向量机在奶牛亚临床乳腺炎预测中的应用。

ScientificWorldJournal. 2013 Dec 25;2013:603897. doi: 10.1155/2013/603897. eCollection 2013.

Combining heterogeneous data sources for accurate functional annotation of proteins.整合异构数据源以实现蛋白质功能注释的准确性。

BMC Bioinformatics. 2013;14 Suppl 3(Suppl 3):S10. doi: 10.1186/1471-2105-14-S3-S10. Epub 2013 Feb 28.

Analysis of an environmental exposure health questionnaire in a metropolitan minority population utilizing logistic regression and Support Vector Machines.利用逻辑回归和支持向量机对大都市少数族裔人群的环境暴露健康问卷进行分析。

J Health Care Poor Underserved. 2013 Feb;24(1 Suppl):153-71. doi: 10.1353/hpu.2013.0046.

Application of support vector machine for prediction of medication adherence in heart failure patients.支持向量机在预测心力衰竭患者用药依从性中的应用。

Healthc Inform Res. 2010 Dec;16(4):253-9. doi: 10.4258/hir.2010.16.4.253. Epub 2010 Dec 31.

Mining semantic networks of bioinformatics e-resources from the literature.从文献中挖掘生物信息学电子资源的语义网络。

J Biomed Semantics. 2011 Mar 7;2 Suppl 1(Suppl 1):S4. doi: 10.1186/2041-1480-2-S1-S4.

本文引用的文献

BioCreAtIvE task 1A: gene mention finding evaluation.生物创意任务1A：基因提及发现评估。

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S2. doi: 10.1186/1471-2105-6-S1-S2. Epub 2005 May 24.

Evaluation of BioCreAtIvE assessment of task 2.生物创意任务2评估的评价

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S16. doi: 10.1186/1471-2105-6-S1-S16. Epub 2005 May 24.

Overview of BioCreAtIvE task 1B: normalized gene lists.生物创意任务1B概述：标准化基因列表。

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S11. doi: 10.1186/1471-2105-6-S1-S11. Epub 2005 May 24.

Term identification in the biomedical literature.生物医学文献中的术语识别。

J Biomed Inform. 2004 Dec;37(6):512-26. doi: 10.1016/j.jbi.2004.08.004.

Terminology-driven mining of biomedical literature.基于术语驱动的生物医学文献挖掘

Bioinformatics. 2003 May 22;19(8):938-43. doi: 10.1093/bioinformatics/btg105.

PreBIND and Textomy--mining the biomedical literature for protein-protein interactions using a support vector machine.PreBIND和Textomy——使用支持向量机挖掘生物医学文献中的蛋白质-蛋白质相互作用。

BMC Bioinformatics. 2003 Mar 27;4:11. doi: 10.1186/1471-2105-4-11.

The lexical properties of the gene ontology.基因本体论的词汇属性。

Proc AMIA Symp. 2002:504-8.

The BRCA1 and BARD1 association with the RNA polymerase II holoenzyme.BRCA1与BARD1和RNA聚合酶II全酶的关联。

Cancer Res. 2002 Aug 1;62(15):4222-8.

Predicting the sub-cellular location of proteins from text using support vector machines.使用支持向量机从文本中预测蛋白质的亚细胞定位。

Pac Symp Biocomput. 2002:374-85. doi: 10.1142/9789812799623_0035.

Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature.利用生物医学文献的最大熵分析将基因与基因本体编码相关联。

Genome Res. 2002 Jan;12(1):203-14. doi: 10.1101/gr.199701.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用基于术语的支持向量机从文本中挖掘蛋白质功能。

Mining protein function from text using term-based support vector machines.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献