Laza Rosalía, Pavón Reyes, Reboiro-Jato Miguel, Fdez-Riverola Florentino
ESEI, Escuela Superior de Ingeniería Informática, University of Vigo, Edificio Politécnico, Campus Universitario As Lagoas s/n, 32004, Ourense, Spain.
J Integr Bioinform. 2011 Sep 16;8(3):177. doi: 10.2390/biecoll-jib-2011-177.
Nowadays, document classification has become an interesting research field. Partly, this is due to the increasing availability of biomedical information in digital form which is necessary to catalogue and organize. In this context, machine learning techniques are usually applied to text classification by using a general inductive process that automatically builds a text classifier from a set of pre-classified documents. Related with this domain, imbalanced data is a well-known problem in many practical applications of knowledge discovery and its effects on the performance of standard classifiers are remarkable. In this paper, we investigate the application of a Bayesian Network (BN) model for the triage of documents, which are represented by the association of different MeSH terms. Our results show that BNs are adequate for describing conditional independencies between MeSH terms and that MeSH ontology is a valuable resource for representing Medline documents at different abstraction levels. Moreover, we perform an extensive experimental evaluation to investigate if the classification of Medline documents using a BN classifier poses additional challenges when dealing with class-imbalanced prediction. The evaluation involves two methods, under-sampling and cost-sensitive learning. We conclude that BN classifier is sensitive to both balancing strategies and existing techniques can improve its overall performance.
如今,文档分类已成为一个有趣的研究领域。部分原因在于以数字形式存在的生物医学信息越来越多,而对这些信息进行编目和组织是必要的。在这种背景下,机器学习技术通常通过使用一种通用归纳过程应用于文本分类,该过程从一组预先分类的文档中自动构建一个文本分类器。与该领域相关的是,不平衡数据在知识发现的许多实际应用中是一个众所周知的问题,并且它对标准分类器性能的影响非常显著。在本文中,我们研究了贝叶斯网络(BN)模型在文档分类中的应用,这些文档由不同医学主题词(MeSH)术语的关联表示。我们的结果表明,贝叶斯网络足以描述医学主题词之间的条件独立性,并且医学主题词本体是在不同抽象层次上表示医学文献数据库(Medline)文档的宝贵资源。此外,我们进行了广泛的实验评估,以研究使用贝叶斯网络分类器对医学文献数据库文档进行分类在处理类别不平衡预测时是否会带来额外挑战。该评估涉及两种方法,欠采样和成本敏感学习。我们得出结论,贝叶斯网络分类器对两种平衡策略都很敏感,现有技术可以提高其整体性能。