Afshar Shiva, Chen Yinghan, Han Shizhong, Lin Ying
Department of Neurology, Emory University, Atlanta, GA, 30322, USA.
Department of Mathematics and Statistics, University of Nevada, Reno, NV, 89557, USA.
IISE Trans. 2024 Dec 4. doi: 10.1080/24725854.2024.2417258.
Combining multiple predictors obtained from distributed data sources to an accurate meta-learner is promising to achieve enhanced performance in lots of prediction problems. As the accuracy of each predictor is usually unknown, integrating the predictors to achieve better performance is challenging. Conventional ensemble learning methods assess the accuracy of predictors based on extensive labeled data. In practical applications, however, the acquisition of such labeled data can prove to be an arduous task. Furthermore, the predictors under consideration may exhibit high degrees of correlation, particularly when similar data sources or machine learning algorithms were employed during their model training. In response to these challenges, this paper introduces a novel structured unsupervised ensemble learning model (SUEL) to exploit the dependency between a set of predictors with continuous predictive scores, rank the predictors without labeled data and combine them to an ensembled score with weights. Two novel correlation-based decomposition algorithms are further proposed to estimate the SUEL model, constrained quadratic optimization (SUEL.CQO) and matrix-factorization-based (SUEL.MF) approaches. The efficacy of the proposed methods is rigorously assessed through both simulation studies and real-world application of risk genes discovery. The results compellingly demonstrate that the proposed methods can efficiently integrate the dependent predictors to an ensemble model without the need of ground truth data.
将从分布式数据源获得的多个预测器组合成一个准确的元学习器,有望在许多预测问题中实现更高的性能。由于每个预测器的准确性通常是未知的,因此将这些预测器集成以实现更好的性能具有挑战性。传统的集成学习方法基于大量的标记数据来评估预测器的准确性。然而,在实际应用中,获取此类标记数据可能是一项艰巨的任务。此外,所考虑的预测器可能表现出高度的相关性,特别是当在其模型训练期间采用了相似的数据源或机器学习算法时。针对这些挑战,本文引入了一种新颖的结构化无监督集成学习模型(SUEL),以利用具有连续预测分数的一组预测器之间的依赖性,在没有标记数据的情况下对预测器进行排序,并将它们组合成一个带有权重的集成分数。进一步提出了两种基于相关性的新颖分解算法来估计SUEL模型,即约束二次优化(SUEL.CQO)和基于矩阵分解的(SUEL.MF)方法。通过模拟研究和风险基因发现的实际应用,对所提出方法的有效性进行了严格评估。结果有力地证明,所提出的方法可以有效地将相关的预测器集成到一个集成模型中,而无需真实数据。