用于生物医学文档分类的子串选择

Substring selection for biomedical document classification.

作者信息

Han Bo, Obradovic Zoran, Hu Zhang-Zhi, Wu Cathy H, Vucetic Slobodan

机构信息

Center for Information Science and Technology, Temple University, Philadelphia, PA 19122, USA.

出版信息

Bioinformatics. 2006 Sep 1;22(17):2136-42. doi: 10.1093/bioinformatics/btl350. Epub 2006 Jul 12.

DOI:10.1093/bioinformatics/btl350

PMID:16837530

Abstract

MOTIVATION

Attribute selection is a critical step in development of document classification systems. As a standard practice, words are stemmed and the most informative ones are used as attributes in classification. Owing to high complexity of biomedical terminology, general-purpose stemming algorithms are often conservative and could also remove informative stems. This can lead to accuracy reduction, especially when the number of labeled documents is small. To address this issue, we propose an algorithm that omits stemming and, instead, uses the most discriminative substrings as attributes.

RESULTS

The approach was tested on five annotated sets of abstracts from iProLINK that report on the experimental evidence about five types of protein post-translational modifications. The experiments showed that Naive Bayes and support vector machine classifiers perform consistently better [with area under the ROC curve (AUC) accuracy in range 0.92-0.97] when using the proposed attribute selection than when using attributes obtained by the Porter stemmer algorithm (AUC in 0.86-0.93 range). The proposed approach is particularly useful when labeled datasets are small.

摘要

动机

属性选择是文档分类系统开发中的关键步骤。作为一种标准做法，词干提取后，最具信息性的词干用作分类中的属性。由于生物医学术语的高度复杂性，通用词干提取算法往往较为保守，还可能去除有信息价值的词干。这可能导致准确率降低，尤其是在标记文档数量较少时。为解决此问题，我们提出一种算法，该算法省略词干提取，而是使用最具区分性的子串作为属性。

结果

该方法在来自iProLINK的五个带注释的摘要集上进行了测试，这些摘要报告了关于五种蛋白质翻译后修饰类型的实验证据。实验表明，与使用波特词干提取算法获得的属性时相比，使用所提出的属性选择时，朴素贝叶斯和支持向量机分类器的表现始终更好[ROC曲线下面积（AUC）准确率在0.92 - 0.97范围内]，而使用波特词干提取算法时AUC在0.86 - 0.93范围内。当标记数据集较小时，所提出的方法特别有用。

相似文献

Substring selection for biomedical document classification.用于生物医学文档分类的子串选择

Bioinformatics. 2006 Sep 1;22(17):2136-42. doi: 10.1093/bioinformatics/btl350. Epub 2006 Jul 12.

Protein annotation by EBIMed.通过EBIMed进行蛋白质注释。

Nat Biotechnol. 2006 Aug;24(8):902-3. doi: 10.1038/nbt0806-902.

Recognizing names in biomedical texts: a machine learning approach.识别生物医学文本中的名称：一种机器学习方法。

Bioinformatics. 2004 May 1;20(7):1178-90. doi: 10.1093/bioinformatics/bth060. Epub 2004 Feb 10.

Gene symbol disambiguation using knowledge-based profiles.使用基于知识的概况进行基因符号消歧。

Bioinformatics. 2007 Apr 15;23(8):1015-22. doi: 10.1093/bioinformatics/btm056. Epub 2007 Feb 21.

Literature mining and database annotation of protein phosphorylation using a rule-based system.使用基于规则的系统对蛋白质磷酸化进行文献挖掘和数据库注释。

Bioinformatics. 2005 Jun 1;21(11):2759-65. doi: 10.1093/bioinformatics/bti390. Epub 2005 Apr 6.

MedPost: a part-of-speech tagger for bioMedical text.MedPost：一种用于生物医学文本的词性标注器。

Bioinformatics. 2004 Sep 22;20(14):2320-1. doi: 10.1093/bioinformatics/bth227. Epub 2004 Apr 8.

GeneInfoMiner--a web server for exploring biomedical literature using batch sequence ID.基因信息挖掘器——一个使用批量序列ID探索生物医学文献的网络服务器。

Bioinformatics. 2005 Aug 15;21(16):3452-3. doi: 10.1093/bioinformatics/bti559. Epub 2005 Jun 30.

Automatic assignment of biomedical categories: toward a generic approach.生物医学类别的自动分配：迈向通用方法

Bioinformatics. 2006 Mar 15;22(6):658-64. doi: 10.1093/bioinformatics/bti783. Epub 2005 Nov 15.

Using argumentation to extract key sentences from biomedical abstracts.利用论证从生物医学摘要中提取关键句子。

Int J Med Inform. 2007 Feb-Mar;76(2-3):195-200. doi: 10.1016/j.ijmedinf.2006.05.002. Epub 2006 Jul 11.

Corpus annotation for mining biomedical events from literature.用于从文献中挖掘生物医学事件的语料库标注。

BMC Bioinformatics. 2008 Jan 8;9:10. doi: 10.1186/1471-2105-9-10.

引用本文的文献

Utilizing image and caption information for biomedical document classification.利用图像和标题信息进行生物医学文献分类。

Bioinformatics. 2021 Jul 12;37(Suppl_1):i468-i476. doi: 10.1093/bioinformatics/btab331.

Phylogenetic and biological significance of evolutionary elements from metazoan mitochondrial genomes.后生动物线粒体基因组进化元件的系统发育和生物学意义。

PLoS One. 2014 Jan 20;9(1):e84330. doi: 10.1371/journal.pone.0084330. eCollection 2014.

Enhancing navigation in biomedical databases by community voting and database-driven text classification.通过社区投票和数据库驱动的文本分类增强生物医学数据库中的导航。

BMC Bioinformatics. 2009 Oct 3;10:317. doi: 10.1186/1471-2105-10-317.

The first step in the development of Text Mining technology for Cancer Risk Assessment: identifying and organizing scientific evidence in risk assessment literature.癌症风险评估文本挖掘技术的发展的第一步：识别和组织风险评估文献中的科学证据。

BMC Bioinformatics. 2009 Sep 22;10:303. doi: 10.1186/1471-2105-10-303.

GAPscreener: an automatic tool for screening human genetic association literature in PubMed using the support vector machine technique.GAP筛选器：一种利用支持向量机技术在PubMed中筛选人类基因关联文献的自动工具。

BMC Bioinformatics. 2008 Apr 22;9:205. doi: 10.1186/1471-2105-9-205.

Exploiting and integrating rich features for biological literature classification.利用并整合丰富特征进行生物文献分类。

BMC Bioinformatics. 2008 Apr 11;9 Suppl 3(Suppl 3):S4. doi: 10.1186/1471-2105-9-S3-S4.

Automating document classification for the Immune Epitope Database.免疫表位数据库的文档分类自动化

BMC Bioinformatics. 2007 Jul 26;8:269. doi: 10.1186/1471-2105-8-269.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

用于生物医学文档分类的子串选择

Substring selection for biomedical document classification.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

动机

结果

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献