Lin Jau-Huei, Haug Peter J
Department of Biomedical Informatics, University of Utah, 26 South 2000 East Room 5775 HSEB, Salt Lake City, UT 84112-5750, USA.
J Biomed Inform. 2008 Feb;41(1):1-14. doi: 10.1016/j.jbi.2007.06.001. Epub 2007 Jun 9.
When machine learning algorithms are applied to data collected during the course of clinical care, it is generally accepted that the data has not been consistently collected. The absence of expected data elements is common and the mechanism through which a data element is missing often involves the clinical relevance of that data element in a specific patient. Therefore, the absence of data may have information value of its own. In the process of designing an application intended to support a medical problem list, we have studied whether the "missingness" of clinical data can provide useful information in building prediction models. In this study, we experimented with four methods of treating missing values in a clinical data set-two of them explicitly model the absence or "missingness" of data. Each of these data sets were used to build four different kinds of Bayesian classifiers-a naive Bayes structure, a human-composed network structure, and two networks based on structural learning algorithms. We compared the performance between groups with and without explicit models of missingness using the area under the ROC curve. The results showed that in most cases the classifiers trained using the explicit missing value treatments performed better. The result suggests that information may exist in "missingness" itself. Thus, when designing a decision support system, we suggest one consider explicitly representing the presence/absence of data in the underlying logic.
当机器学习算法应用于临床护理过程中收集的数据时,人们普遍认为这些数据并非始终如一地收集。缺少预期的数据元素很常见,数据元素缺失的机制通常涉及该数据元素在特定患者中的临床相关性。因此,数据的缺失可能本身就具有信息价值。在设计一个旨在支持医疗问题列表的应用程序的过程中,我们研究了临床数据的“缺失性”是否能在构建预测模型时提供有用信息。在本研究中,我们试验了临床数据集中处理缺失值的四种方法,其中两种方法明确对数据的缺失或“缺失性”进行建模。每个数据集都用于构建四种不同类型的贝叶斯分类器——朴素贝叶斯结构、人工构建的网络结构以及基于结构学习算法的两种网络。我们使用ROC曲线下面积比较了有无明确缺失性模型的组间性能。结果表明,在大多数情况下,使用明确缺失值处理方法训练的分类器表现更好。该结果表明“缺失性”本身可能存在信息。因此,在设计决策支持系统时,我们建议在底层逻辑中明确考虑数据的存在/缺失情况。