Raynal Louis, Hoffmann Till, Onnela Jukka-Pekka
Department of Biostatistics, T.H. Chan School of Public Health, Harvard University.
J Comput Graph Stat. 2023;32(3):1109-1118. doi: 10.1080/10618600.2022.2151453. Epub 2023 Jan 20.
Selecting a small set of informative features from a large number of possibly noisy candidates is a challenging problem with many applications in machine learning and approximate Bayesian computation. In practice, the cost of computing informative features also needs to be considered. This is particularly important for networks because the computational costs of individual features can span several orders of magnitude. We addressed this issue for the network model selection problem using two approaches. First, we adapted nine feature selection methods to account for the cost of features. We show for two classes of network models that the cost can be reduced by two orders of magnitude without considerably affecting classification accuracy (proportion of correctly identified models). Second, we selected features using pilot simulations with smaller networks. This approach reduced the computational cost by a factor of 50 without affecting classification accuracy. To demonstrate the utility of our approach, we applied it to three different yeast protein interaction networks and identified the best-fitting duplication divergence model. Supplemental materials, including computer code to reproduce our results, are available online.
从大量可能存在噪声的候选特征中选择一小部分信息丰富的特征是一个具有挑战性的问题,在机器学习和近似贝叶斯计算中有许多应用。在实践中,还需要考虑计算信息丰富特征的成本。这对于网络来说尤为重要,因为单个特征的计算成本可能跨越几个数量级。我们使用两种方法解决了网络模型选择问题中的这个问题。首先,我们采用了九种特征选择方法来考虑特征成本。对于两类网络模型,我们表明可以将成本降低两个数量级,而不会对分类准确率(正确识别模型的比例)产生太大影响。其次,我们使用较小网络的先导模拟来选择特征。这种方法在不影响分类准确率的情况下将计算成本降低了50倍。为了证明我们方法的实用性,我们将其应用于三个不同的酵母蛋白质相互作用网络,并确定了最佳拟合的复制分歧模型。补充材料,包括用于重现我们结果的计算机代码,可在线获取。