Rios Anthony, Kavuluru Ramakanth
Department of Computer Science, University of Kentucky, Lexington, Kentucky.
Division of Biomedical Informatics, Depts. of Biostatistics and Computer Science, University of Kentucky, Lexington, Kentucky.
ACM BCB. 2015 Sep;2015:258-267. doi: 10.1145/2808719.2808746.
Building high accuracy text classifiers is an important task in biomedicine given the wealth of information hidden in unstructured narratives such as research articles and clinical documents. Due to large feature spaces, traditionally, discriminative approaches such as logistic regression and support vector machines with n-gram and semantic features (e.g., named entities) have been used for text classification where additional performance gains are typically made through feature selection and ensemble approaches. In this paper, we demonstrate that a more direct approach using convolutional neural networks (CNNs) outperforms several traditional approaches in biomedical text classification with the specific use-case of assigning medical subject headings (or MeSH terms) to biomedical articles. Trained annotators at the national library of medicine (NLM) assign on an average 13 codes to each biomedical article, thus semantically indexing scientific literature to support NLM's PubMed search system. Recent evidence suggests that effective automated efforts for MeSH term assignment start with binary classifiers for each term. In this paper, we use CNNs to build binary text classifiers and achieve an absolute improvement of over 3% in macro F-score over a set of selected hard-to-classify MeSH terms when compared with the best prior results on a public dataset. Additional experiments on 50 high frequency terms in the dataset also show improvements with CNNs. Our results indicate the strong potential of CNNs in biomedical text classification tasks.
鉴于诸如研究文章和临床文档等非结构化叙述中隐藏着丰富的信息,构建高精度文本分类器是生物医学中的一项重要任务。由于特征空间较大,传统上,诸如逻辑回归和具有n元语法和语义特征(例如命名实体)的支持向量机等判别方法已被用于文本分类,其中通常通过特征选择和集成方法来进一步提高性能。在本文中,我们证明了一种使用卷积神经网络(CNN)的更直接方法在生物医学文本分类中优于几种传统方法,具体应用案例是为生物医学文章分配医学主题词(或MeSH词)。美国国立医学图书馆(NLM)的专业注释人员平均为每篇生物医学文章分配13个代码,从而对科学文献进行语义索引,以支持NLM的PubMed搜索系统。最近的证据表明,有效的MeSH词自动分配工作从每个词的二元分类器开始。在本文中,我们使用CNN构建二元文本分类器,与公共数据集上之前的最佳结果相比,在一组选定的难以分类的MeSH词上,宏观F值绝对提高了3%以上。对数据集中50个高频词的额外实验也显示了CNN的改进效果。我们的结果表明CNN在生物医学文本分类任务中具有强大的潜力。