Han Bo, Obradovic Zoran, Hu Zhang-Zhi, Wu Cathy H, Vucetic Slobodan
Center for Information Science and Technology, Temple University, Philadelphia, PA 19122, USA.
Bioinformatics. 2006 Sep 1;22(17):2136-42. doi: 10.1093/bioinformatics/btl350. Epub 2006 Jul 12.
Attribute selection is a critical step in development of document classification systems. As a standard practice, words are stemmed and the most informative ones are used as attributes in classification. Owing to high complexity of biomedical terminology, general-purpose stemming algorithms are often conservative and could also remove informative stems. This can lead to accuracy reduction, especially when the number of labeled documents is small. To address this issue, we propose an algorithm that omits stemming and, instead, uses the most discriminative substrings as attributes.
The approach was tested on five annotated sets of abstracts from iProLINK that report on the experimental evidence about five types of protein post-translational modifications. The experiments showed that Naive Bayes and support vector machine classifiers perform consistently better [with area under the ROC curve (AUC) accuracy in range 0.92-0.97] when using the proposed attribute selection than when using attributes obtained by the Porter stemmer algorithm (AUC in 0.86-0.93 range). The proposed approach is particularly useful when labeled datasets are small.
属性选择是文档分类系统开发中的关键步骤。作为一种标准做法,词干提取后,最具信息性的词干用作分类中的属性。由于生物医学术语的高度复杂性,通用词干提取算法往往较为保守,还可能去除有信息价值的词干。这可能导致准确率降低,尤其是在标记文档数量较少时。为解决此问题,我们提出一种算法,该算法省略词干提取,而是使用最具区分性的子串作为属性。
该方法在来自iProLINK的五个带注释的摘要集上进行了测试,这些摘要报告了关于五种蛋白质翻译后修饰类型的实验证据。实验表明,与使用波特词干提取算法获得的属性时相比,使用所提出的属性选择时,朴素贝叶斯和支持向量机分类器的表现始终更好[ROC曲线下面积(AUC)准确率在0.92 - 0.97范围内],而使用波特词干提取算法时AUC在0.86 - 0.93范围内。当标记数据集较小时,所提出的方法特别有用。