Mac Namee B, Cunningham P, Byrne S, Corrigan O I
Department of Computer Science, Trinity College, 2, Dublin, Ireland.
Artif Intell Med. 2002 Jan;24(1):51-70. doi: 10.1016/s0933-3657(01)00092-6.
This paper describes a bias problem encountered in a machine learning approach to outcome prediction in anticoagulant drug therapy. The outcome to be predicted is a measure of the clotting time for the patient; this measure is continuous and so the prediction task is a regression problem. Artificial neural networks (ANNs) are a powerful mechanism for learning to predict such outcomes from training data. However, experiments have shown that an ANN is biased towards values more commonly occurring in the training data and is thus, less likely to be correct in predicting extreme values. This issue of bias in training data in regression problems is similar to the associated problem with minority classes in classification. However, this bias issue in classification is well documented and is an on-going area of research. In this paper, we consider stratified sampling and boosting as solutions to this bias problem and evaluate them on this outcome prediction problem and on two other datasets. Both approaches produce some improvements with boosting showing the most promise.
本文描述了在抗凝药物治疗结果预测的机器学习方法中遇到的一个偏差问题。要预测的结果是患者凝血时间的一种度量;该度量是连续的,因此预测任务是一个回归问题。人工神经网络(ANNs)是一种从训练数据中学习预测此类结果的强大机制。然而,实验表明,人工神经网络倾向于训练数据中更常见的值,因此在预测极端值时不太可能正确。回归问题中训练数据的偏差问题类似于分类中少数类别的相关问题。然而,分类中的这个偏差问题已有充分记录,并且是一个正在进行研究的领域。在本文中,我们考虑分层抽样和增强作为解决此偏差问题的方法,并在这个结果预测问题以及另外两个数据集上对它们进行评估。两种方法都产生了一些改进,其中增强显示出最有前景。