Tsamardinos Ioannis, Borboudakis Giorgos, Katsogridakis Pavlos, Pratikakis Polyvios, Christophides Vassilis
1Computer Science Department, University of Crete, Heraklion, Greece.
Gnosis Data Analysis PC, Heraklion, Greece.
Mach Learn. 2019;108(2):149-202. doi: 10.1007/s10994-018-5748-7. Epub 2018 Aug 7.
We present the (PFBP) algorithm for (FS) for Big Data of high dimensionality. PFBP partitions the data matrix both in terms of rows as well as columns. By employing the concepts of -values of conditional independence tests and meta-analysis techniques, PFBP relies only on computations local to a partition while minimizing communication costs, thus massively parallelizing computations. Similar techniques for combining local computations are also employed to create the final predictive model. PFBP employs asymptotically sound heuristics to make early, approximate decisions, such as of features from consideration in subsequent iterations, of consideration of features within the same iteration, or of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions representable by a causal network (Bayesian network or maximal ancestral graph). Empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores. An extensive comparative evaluation also demonstrates the effectiveness of PFBP against other algorithms in its class. The heuristics presented are general and could potentially be employed to other greedy-type of FS algorithms. An application on simulated Single Nucleotide Polymorphism (SNP) data with 500K samples is provided as a use case.
我们提出了用于高维大数据的特征选择(FS)的分区特征选择(PFBP)算法。PFBP 对数据矩阵按行和列进行分区。通过运用条件独立性检验的 p 值概念和元分析技术,PFBP 仅依赖于分区内的局部计算,同时将通信成本降至最低,从而大规模并行化计算。用于组合局部计算的类似技术也被用于创建最终的预测模型。PFBP 采用渐近合理的启发式方法来做出早期的近似决策,例如在后续迭代中从考虑范围中排除特征、在同一迭代中排除特征的考虑或在每次迭代中排除获胜者。PFBP 为可由因果网络(贝叶斯网络或最大祖先图)表示的数据分布提供渐近最优性保证。实证分析证实了该算法随着样本量增加的超线性加速、相对于特征数量和处理核心的线性可扩展性。广泛的比较评估还证明了 PFBP 相对于同类其他算法的有效性。所提出的启发式方法具有通用性,可能会被应用于其他贪婪型的 FS 算法。作为一个用例,给出了在具有 50 万个样本的模拟单核苷酸多态性(SNP)数据上的应用。