Suppr超能文献

使用基于术语的支持向量机从文本中挖掘蛋白质功能。

Mining protein function from text using term-based support vector machines.

作者信息

Rice Simon B, Nenadic Goran, Stapley Benjamin J

机构信息

Faculty of Life Sciences, University of Manchester, UK.

出版信息

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S22. doi: 10.1186/1471-2105-6-S1-S22. Epub 2005 May 24.

Abstract

BACKGROUND

Text mining has spurred huge interest in the domain of biology. The goal of the BioCreAtIvE exercise was to evaluate the performance of current text mining systems. We participated in Task 2, which addressed assigning Gene Ontology terms to human proteins and selecting relevant evidence from full-text documents. We approached it as a modified form of the document classification task. We used a supervised machine-learning approach (based on support vector machines) to assign protein function and select passages that support the assignments. As classification features, we used a protein's co-occurring terms that were automatically extracted from documents.

RESULTS

The results evaluated by curators were modest, and quite variable for different problems: in many cases we have relatively good assignment of GO terms to proteins, but the selected supporting text was typically non-relevant (precision spanning from 3% to 50%). The method appears to work best when a substantial set of relevant documents is obtained, while it works poorly on single documents and/or short passages. The initial results suggest that our approach can also mine annotations from text even when an explicit statement relating a protein to a GO term is absent.

CONCLUSION

A machine learning approach to mining protein function predictions from text can yield good performance only if sufficient training data is available, and significant amount of supporting data is used for prediction. The most promising results are for combined document retrieval and GO term assignment, which calls for the integration of methods developed in BioCreAtIvE Task 1 and Task 2.

摘要

背景

文本挖掘在生物学领域引发了极大的兴趣。生物创意(BioCreAtIvE)项目的目标是评估当前文本挖掘系统的性能。我们参与了任务2,该任务涉及为人类蛋白质分配基因本体(Gene Ontology)术语,并从全文文档中选择相关证据。我们将其作为文档分类任务的一种改进形式来处理。我们使用了一种监督式机器学习方法(基于支持向量机)来分配蛋白质功能并选择支持这些分配的段落。作为分类特征,我们使用了从文档中自动提取的与蛋白质共同出现的术语。

结果

由管理员评估的结果一般,并且因不同问题差异很大:在许多情况下,我们对蛋白质的基因本体术语分配相对较好,但所选的支持文本通常不相关(精确率从3%到50%不等)。当获得大量相关文档时,该方法似乎效果最佳,而在单个文档和/或短段落上效果较差。初步结果表明,即使在没有将蛋白质与基因本体术语相关联的明确陈述时,我们的方法也能从文本中挖掘注释。

结论

只有当有足够的训练数据可用,并且大量支持数据用于预测时,一种从文本中挖掘蛋白质功能预测的机器学习方法才能产生良好的性能。最有前景的结果是用于文档检索和基因本体术语分配的结合,这需要整合在生物创意任务1和任务2中开发的方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dd39/1869015/513190e24817/1471-2105-6-S1-S22-1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验