Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA.
The Jackson Laboratory, 600 Main St., Bar Harbor, ME, USA.
Database (Oxford). 2019 Jan 1;2019. doi: 10.1093/database/baz045.
Published literature is an important source of knowledge supporting biomedical research. Given the large and increasing number of publications, automated document classification plays an important role in biomedical research. Effective biomedical document classifiers are especially needed for bio-databases, in which the information stems from many thousands of biomedical publications that curators must read in detail and annotate. In addition, biomedical document classification often amounts to identifying a small subset of relevant publications within a much larger collection of available documents. As such, addressing class imbalance is essential to a practical classifier. We present here an effective classification scheme for automatically identifying papers among a large pool of biomedical publications that contain information relevant to a specific topic, which the curators are interested in annotating. The proposed scheme is based on a meta-classification framework using cluster-based under-sampling combined with named-entity recognition and statistical feature selection strategies. We examined the performance of our method over a large imbalanced data set that was originally manually curated by the Jackson Laboratory's Gene Expression Database (GXD). The set consists of more than 90 000 PubMed abstracts, of which about 13 000 documents are labeled as relevant to GXD while the others are not relevant. Our results, 0.72 precision, 0.80 recall and 0.75 f-measure, demonstrate that our proposed classification scheme effectively categorizes such a large data set in the face of data imbalance.
已发表的文献是支持生物医学研究的重要知识来源。鉴于出版物数量庞大且不断增加,自动化文档分类在生物医学研究中起着重要作用。生物数据库尤其需要有效的生物医学文档分类器,这些数据库的信息来自成千上万份生物医学出版物,管理员必须详细阅读并注释这些出版物。此外,生物医学文档分类通常相当于在大量可用文档中识别一小部分相关出版物。因此,解决类别不平衡问题对于实用的分类器至关重要。我们在这里提出了一种有效的分类方案,用于自动识别大量生物医学出版物中的论文,这些论文包含管理员感兴趣注释的特定主题的相关信息。所提出的方案基于元分类框架,使用基于聚类的欠采样结合命名实体识别和统计特征选择策略。我们在一个由杰克逊实验室基因表达数据库(GXD)手动整理的大型不平衡数据集上检验了我们方法的性能。该数据集由超过 90000 篇 PubMed 摘要组成,其中约 13000 篇文档标记为与 GXD 相关,而其他文档则不相关。我们的结果为 0.72 的精度、0.80 的召回率和 0.75 的 F1 度量,表明我们提出的分类方案在面对数据不平衡时能够有效地对如此庞大的数据集进行分类。