Suppr超能文献

文本分类性能:样本量是唯一需要考虑的因素吗?

Text classification performance: is the sample size the only factor to be considered?

作者信息

Figueroa Rosa L, Zeng-Treitler Qing

机构信息

Departamento de Ingeniería Eléctrica, Facultad de Ingeniería, Universidad de Concepción, Chile.

出版信息

Stud Health Technol Inform. 2013;192:1193.

Abstract

The use of text mining and supervised machine learning algorithms on biomedical databases has become increasingly common. However, a question remains: How much data must be annotated to create a suitable training set for a machine learning classifier? In prior research with active learning in medical text classification, we found evidence that not only sample size but also some of the intrinsic characteristics of the texts being analyzed-such as the size of the vocabulary and the length of a document-may also influence the resulting classifier's performance. This study is an attempt to create a regression model to predict performance based on sample size and other text features. While the model needs to be trained on existing datasets, we believe it is feasible to predict performance without obtaining annotations from new datasets once the model is built.

摘要

在生物医学数据库上使用文本挖掘和监督式机器学习算法已变得越来越普遍。然而,一个问题仍然存在:为机器学习分类器创建合适的训练集必须标注多少数据?在先前关于医学文本分类中主动学习的研究中,我们发现证据表明,不仅样本大小,而且被分析文本的一些内在特征——如词汇量大小和文档长度——也可能影响最终分类器的性能。本研究试图创建一个回归模型,以基于样本大小和其他文本特征预测性能。虽然该模型需要在现有数据集上进行训练,但我们相信,一旦模型构建完成,无需从新数据集中获取标注就可以预测性能。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验