Cohen Aaron M
Oregon Health & Science University, Portland, OR, USA.
AMIA Annu Symp Proc. 2006;2006:161-5.
Automated document classification can be a valuable tool for biomedical tasks that involve large amounts of text. However, in biomedicine, documents that have the desired properties are often rare, and special methods are usually required to address this issue. We propose and evaluate a method of classifying biomedical text documents, optimizing for utility when misclassification costs are highly asymmetric between the positive and negative classes. The method uses chi-square feature selection and several iterations of cost proportionate rejection sampling followed by application of a support vector machine (SVM), combining the resulting classifier results with voting. It is straightforward, fast, and achieves competitive performance on a set of standardized biomedical text classification evaluation tasks. The method is a good general purpose approach for classifying biomedical text.
自动文档分类对于涉及大量文本的生物医学任务而言可能是一种有价值的工具。然而,在生物医学领域,具有所需属性的文档往往很少见,通常需要特殊方法来解决这一问题。我们提出并评估了一种生物医学文本文档分类方法,当正类和负类之间的误分类成本高度不对称时,该方法针对效用进行了优化。该方法使用卡方特征选择和成本比例拒绝采样的多次迭代,随后应用支持向量机(SVM),并通过投票将所得分类器结果相结合。它简单、快速,并且在一组标准化生物医学文本分类评估任务中取得了有竞争力的性能。该方法是一种用于生物医学文本分类的良好通用方法。