文本分类性能：样本量是唯一需要考虑的因素吗？

Text classification performance: is the sample size the only factor to be considered?

作者信息

Figueroa Rosa L, Zeng-Treitler Qing

机构信息

Departamento de Ingeniería Eléctrica, Facultad de Ingeniería, Universidad de Concepción, Chile.

出版信息

Stud Health Technol Inform. 2013;192:1193.

PMID:23920967

Abstract

The use of text mining and supervised machine learning algorithms on biomedical databases has become increasingly common. However, a question remains: How much data must be annotated to create a suitable training set for a machine learning classifier? In prior research with active learning in medical text classification, we found evidence that not only sample size but also some of the intrinsic characteristics of the texts being analyzed-such as the size of the vocabulary and the length of a document-may also influence the resulting classifier's performance. This study is an attempt to create a regression model to predict performance based on sample size and other text features. While the model needs to be trained on existing datasets, we believe it is feasible to predict performance without obtaining annotations from new datasets once the model is built.

摘要

在生物医学数据库上使用文本挖掘和监督式机器学习算法已变得越来越普遍。然而，一个问题仍然存在：为机器学习分类器创建合适的训练集必须标注多少数据？在先前关于医学文本分类中主动学习的研究中，我们发现证据表明，不仅样本大小，而且被分析文本的一些内在特征——如词汇量大小和文档长度——也可能影响最终分类器的性能。本研究试图创建一个回归模型，以基于样本大小和其他文本特征预测性能。虽然该模型需要在现有数据集上进行训练，但我们相信，一旦模型构建完成，无需从新数据集中获取标注就可以预测性能。

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文本分类性能：样本量是唯一需要考虑的因素吗？

Text classification performance: is the sample size the only factor to be considered?

作者信息

机构信息

出版信息

文献检索

文件翻译

深度研究

文本分类性能：样本量是唯一需要考虑的因素吗？

Text classification performance: is the sample size the only factor to be considered?

作者信息

机构信息

出版信息

相似文献

相似文献