Malley J D, Kruppa J, Dasgupta A, Malley K G, Ziegler A
Center for Computational Bioscience, Center for Information Technology, National Institutes of Health, Bethesda, USA.
Methods Inf Med. 2012;51(1):74-81. doi: 10.3414/ME00-01-0052. Epub 2011 Sep 14.
Most machine learning approaches only provide a classification for binary responses. However, probabilities are required for risk estimation using individual patient characteristics. It has been shown recently that every statistical learning machine known to be consistent for a nonparametric regression problem is a probability machine that is provably consistent for this estimation problem.
The aim of this paper is to show how random forests and nearest neighbors can be used for consistent estimation of individual probabilities.
Two random forest algorithms and two nearest neighbor algorithms are described in detail for estimation of individual probabilities. We discuss the consistency of random forests, nearest neighbors and other learning machines in detail. We conduct a simulation study to illustrate the validity of the methods. We exemplify the algorithms by analyzing two well-known data sets on the diagnosis of appendicitis and the diagnosis of diabetes in Pima Indians.
Simulations demonstrate the validity of the method. With the real data application, we show the accuracy and practicality of this approach. We provide sample code from R packages in which the probability estimation is already available. This means that all calculations can be performed using existing software.
Random forest algorithms as well as nearest neighbor approaches are valid machine learning methods for estimating individual probabilities for binary responses. Freely available implementations are available in R and may be used for applications.
大多数机器学习方法仅提供二元响应的分类。然而,使用个体患者特征进行风险估计需要概率。最近已经表明,每一种已知对非参数回归问题一致的统计学习机器都是对该估计问题可证明一致的概率机器。
本文的目的是展示如何使用随机森林和最近邻方法来一致地估计个体概率。
详细描述了两种用于估计个体概率的随机森林算法和两种最近邻算法。我们详细讨论了随机森林、最近邻和其他学习机器的一致性。我们进行了一项模拟研究以说明这些方法的有效性。我们通过分析两个关于阑尾炎诊断和皮马印第安人糖尿病诊断的著名数据集来举例说明这些算法。
模拟证明了该方法的有效性。通过实际数据应用,我们展示了这种方法的准确性和实用性。我们提供了来自R包的示例代码,其中已经可以进行概率估计。这意味着所有计算都可以使用现有软件进行。
随机森林算法以及最近邻方法是用于估计二元响应个体概率的有效机器学习方法。在R中有免费可用的实现,可用于实际应用。