Health Systems Management, Rush University Medical Center, 1700 W. Van Buren Street, Chicago, IL 60612, USA.
Comput Biol Med. 2010 Mar;40(3):288-99. doi: 10.1016/j.compbiomed.2009.12.005. Epub 2010 Jan 25.
A relabeling algorithm for retrieval of noisy instances with binary outcomes is presented. The relabeling algorithm iteratively retrieves, selects, and re-labels data instances (i.e., transforms a decision space) to improve prediction quality. It emphasizes knowledge generalization and confidence rather than classification accuracy. A confidence index incorporating classification accuracy, prediction error, impurities in the relabeled dataset, and cluster purities was designed. The proposed approach is illustrated with a binary outcome dataset and was successfully tested on the standard benchmark four UCI repository dataset as well as bladder cancer immunotherapy data. A subset of the most stable instances (i.e., 7% to 51% of the sample) with high confidence (i.e., between 64%-99.44%) was identified for each application along with most noisy instances. The domain experts and the extracted knowledge validated the relabeled instances and corresponding confidence indexes. The relabeling algorithm with some modifications can be applied to other medical, industrial, and service domains.
提出了一种用于检索具有二项式结果的噪声实例的重新标记算法。重新标记算法通过迭代检索、选择和重新标记数据实例(即转换决策空间)来提高预测质量。它强调知识泛化和置信度,而不是分类准确性。设计了一个置信指数,其中包含分类准确性、预测误差、重新标记数据集的杂质以及聚类纯度。该方法使用二项式结果数据集进行说明,并成功应用于标准基准 UCI 存储库数据集以及膀胱癌免疫治疗数据。为每个应用程序确定了具有高置信度(即 64%-99.44%)的最稳定实例(即样本的 7%至 51%)的子集,以及最嘈杂的实例。领域专家和提取的知识验证了重新标记的实例及其相应的置信度指数。经过一些修改,重新标记算法可以应用于其他医疗、工业和服务领域。