Queen Katelyn J, Barrett Malcolm, Millstein Joshua
Department of Population and Public Health Sciences, University of Southern California, Los Angeles, California, United States.
Department of Health Policy, Stanford University, Stanford, California, United States.
PeerJ. 2025 Jan 27;13:e18580. doi: 10.7717/peerj.18580. eCollection 2025.
As data sets increase in size and complexity with advancing technology, flexible and interpretable data reduction methods that quantify information preservation become increasingly important.
Super Partition is a large-scale approximation of the original Partition data reduction algorithm that allows the user to flexibly specify the minimum amount of information captured for each input feature. In an initial step, Genie, a fast, hierarchical clustering algorithm, forms a super-partition, thereby increasing the computational tractability by allowing Partition to be applied to the subsets. Applications to high dimensional data sets show scalability to hundreds of thousands of features with reasonable computation times.
Super Partition is a new function within the partition R package, available on the CRAN repository (https://cran.r-project.org/web/packages/partition/index.html).
随着技术的进步,数据集的规模和复杂性不断增加,能够量化信息保留的灵活且可解释的数据约简方法变得越来越重要。
超级划分是原始划分数据约简算法的大规模近似方法,它允许用户灵活指定为每个输入特征捕获的最小信息量。在初始步骤中,一种快速的层次聚类算法Genie形成一个超级划分,从而通过允许将划分应用于子集来提高计算的易处理性。对高维数据集的应用表明,在合理的计算时间内,该方法可扩展到数十万特征。
超级划分是partition R包中的一个新函数,可在CRAN存储库(https://cran.r-project.org/web/packages/partition/index.html)上获取。