• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

评估不平衡数据在生物医学文档分类中的影响。

Evaluating the effect of unbalanced data in biomedical document classification.

作者信息

Laza Rosalía, Pavón Reyes, Reboiro-Jato Miguel, Fdez-Riverola Florentino

机构信息

ESEI, Escuela Superior de Ingeniería Informática, University of Vigo, Edificio Politécnico, Campus Universitario As Lagoas s/n, 32004, Ourense, Spain.

出版信息

J Integr Bioinform. 2011 Sep 16;8(3):177. doi: 10.2390/biecoll-jib-2011-177.

DOI:10.2390/biecoll-jib-2011-177
PMID:21926440
Abstract

Nowadays, document classification has become an interesting research field. Partly, this is due to the increasing availability of biomedical information in digital form which is necessary to catalogue and organize. In this context, machine learning techniques are usually applied to text classification by using a general inductive process that automatically builds a text classifier from a set of pre-classified documents. Related with this domain, imbalanced data is a well-known problem in many practical applications of knowledge discovery and its effects on the performance of standard classifiers are remarkable. In this paper, we investigate the application of a Bayesian Network (BN) model for the triage of documents, which are represented by the association of different MeSH terms. Our results show that BNs are adequate for describing conditional independencies between MeSH terms and that MeSH ontology is a valuable resource for representing Medline documents at different abstraction levels. Moreover, we perform an extensive experimental evaluation to investigate if the classification of Medline documents using a BN classifier poses additional challenges when dealing with class-imbalanced prediction. The evaluation involves two methods, under-sampling and cost-sensitive learning. We conclude that BN classifier is sensitive to both balancing strategies and existing techniques can improve its overall performance.

摘要

如今,文档分类已成为一个有趣的研究领域。部分原因在于以数字形式存在的生物医学信息越来越多,而对这些信息进行编目和组织是必要的。在这种背景下,机器学习技术通常通过使用一种通用归纳过程应用于文本分类,该过程从一组预先分类的文档中自动构建一个文本分类器。与该领域相关的是,不平衡数据在知识发现的许多实际应用中是一个众所周知的问题,并且它对标准分类器性能的影响非常显著。在本文中,我们研究了贝叶斯网络(BN)模型在文档分类中的应用,这些文档由不同医学主题词(MeSH)术语的关联表示。我们的结果表明,贝叶斯网络足以描述医学主题词之间的条件独立性,并且医学主题词本体是在不同抽象层次上表示医学文献数据库(Medline)文档的宝贵资源。此外,我们进行了广泛的实验评估,以研究使用贝叶斯网络分类器对医学文献数据库文档进行分类在处理类别不平衡预测时是否会带来额外挑战。该评估涉及两种方法,欠采样和成本敏感学习。我们得出结论,贝叶斯网络分类器对两种平衡策略都很敏感,现有技术可以提高其整体性能。

相似文献

1
Evaluating the effect of unbalanced data in biomedical document classification.评估不平衡数据在生物医学文档分类中的影响。
J Integr Bioinform. 2011 Sep 16;8(3):177. doi: 10.2390/biecoll-jib-2011-177.
2
Improving imbalanced scientific text classification using sampling strategies and dictionaries.使用采样策略和词典改进不均衡科学文本分类
J Integr Bioinform. 2011 Sep 15;8(3):176. doi: 10.2390/biecoll-jib-2011-176.
3
Improving MeSH classification of biomedical articles using citation contexts.利用引文语境提高生物医学文献的 MeSH 分类
J Biomed Inform. 2011 Oct;44(5):881-96. doi: 10.1016/j.jbi.2011.05.007. Epub 2011 Jun 12.
4
Text mining for traditional Chinese medical knowledge discovery: a survey.基于文本挖掘的中医药知识发现研究综述。
J Biomed Inform. 2010 Aug;43(4):650-60. doi: 10.1016/j.jbi.2010.01.002. Epub 2010 Jan 13.
5
Incorporating expert knowledge when learning Bayesian network structure: a medical case study.在学习贝叶斯网络结构时纳入专家知识:一个医学案例研究。
Artif Intell Med. 2011 Nov;53(3):181-204. doi: 10.1016/j.artmed.2011.08.004. Epub 2011 Sep 29.
6
Classification of patients by severity grades during triage in the emergency department using data mining methods.使用数据挖掘方法对急诊科分诊时的患者进行严重程度分级分类。
J Eval Clin Pract. 2012 Apr;18(2):378-88. doi: 10.1111/j.1365-2753.2010.01592.x. Epub 2010 Dec 19.
7
Mixture classification model based on clinical markers for breast cancer prognosis.基于临床标志物的乳腺癌预后混合分类模型。
Artif Intell Med. 2010 Feb-Mar;48(2-3):129-37. doi: 10.1016/j.artmed.2009.07.008. Epub 2009 Dec 14.
8
Large scale biomedical texts classification: a kNN and an ESA-based approaches.大规模生物医学文本分类:基于k近邻算法和基于词嵌入语义分析的方法。
J Biomed Semantics. 2016 Jun 16;7:40. doi: 10.1186/s13326-016-0073-1.
9
Classification methods for finding articles describing protein-protein interactions in PubMed.在PubMed中查找描述蛋白质-蛋白质相互作用文章的分类方法。
J Integr Bioinform. 2011 Sep 16;8(3):178. doi: 10.2390/biecoll-jib-2011-178.
10
An ensemble heterogeneous classification methodology for discovering health-related knowledge in social media messages.一种用于在社交媒体消息中发现健康相关知识的集成异构分类方法。
J Biomed Inform. 2014 Jun;49:255-68. doi: 10.1016/j.jbi.2014.03.005. Epub 2014 Mar 16.

引用本文的文献

1
DMLS: an automated pipeline to extract the Drosophila modular transcription regulators and targets from massive literature articles.DMLS:从大量文献文章中提取果蝇模块化转录调控因子及其靶标的自动化流水线。
Database (Oxford). 2024 Jun 20;2024:0. doi: 10.1093/database/baae049.
2
YTLR: Extracting yeast transcription factor-gene associations from the literature using automated literature readers.YTLR:使用自动文献阅读器从文献中提取酵母转录因子与基因的关联
Comput Struct Biotechnol J. 2022 Aug 24;20:4636-4644. doi: 10.1016/j.csbj.2022.08.041. eCollection 2022.
3
Data analytics and clinical feature ranking of medical records of patients with sepsis.
脓毒症患者病历的数据分析与临床特征排名
BioData Min. 2021 Feb 3;14(1):12. doi: 10.1186/s13040-021-00235-0.
4
Short-term rainfall forecast model based on the improved BP-NN algorithm.基于改进BP神经网络算法的短期降雨预报模型
Sci Rep. 2019 Dec 24;9(1):19751. doi: 10.1038/s41598-019-56452-5.
5
Screening PubMed abstracts: is class imbalance always a challenge to machine learning?筛选PubMed摘要:类别不平衡对机器学习而言始终是一项挑战吗?
Syst Rev. 2019 Dec 6;8(1):317. doi: 10.1186/s13643-019-1245-8.