Figueroa Rosa L, Zeng-Treitler Qing
Departamento de Ingeniería Eléctrica, Facultad de Ingeniería, Universidad de Concepción, Chile.
Stud Health Technol Inform. 2013;192:1193.
The use of text mining and supervised machine learning algorithms on biomedical databases has become increasingly common. However, a question remains: How much data must be annotated to create a suitable training set for a machine learning classifier? In prior research with active learning in medical text classification, we found evidence that not only sample size but also some of the intrinsic characteristics of the texts being analyzed-such as the size of the vocabulary and the length of a document-may also influence the resulting classifier's performance. This study is an attempt to create a regression model to predict performance based on sample size and other text features. While the model needs to be trained on existing datasets, we believe it is feasible to predict performance without obtaining annotations from new datasets once the model is built.
在生物医学数据库上使用文本挖掘和监督式机器学习算法已变得越来越普遍。然而,一个问题仍然存在:为机器学习分类器创建合适的训练集必须标注多少数据?在先前关于医学文本分类中主动学习的研究中,我们发现证据表明,不仅样本大小,而且被分析文本的一些内在特征——如词汇量大小和文档长度——也可能影响最终分类器的性能。本研究试图创建一个回归模型,以基于样本大小和其他文本特征预测性能。虽然该模型需要在现有数据集上进行训练,但我们相信,一旦模型构建完成,无需从新数据集中获取标注就可以预测性能。