Oka Souichi, Takefuji Yoshiyasu
SciencePark Corporation, 3-24-9 Iriya-Nishi Zama-shi, Kanagawa 252-0029, Japan.
Faculty of Data Science, Musashino University, 3-3-3 Ariake Koto-ku, Tokyo 135-8181, Japan.
Sci Total Environ. 2025 Jul 1;984:179714. doi: 10.1016/j.scitotenv.2025.179714. Epub 2025 May 23.
Song et al. (2024), "Prediction of PFAS bioaccumulation in different plant tissues with machine learning models based on molecular fingerprints," employed machine learning methods, such as XGBoost and SHapley Additive exPlanations (SHAP), to predict PFAS bioaccumulation, reporting high predictive accuracy. However, this commentary critically examines their interpretation of feature importance, since high predictive accuracy does not guarantee reliable feature importance. Both XGBoost and SHAP are known to exhibit biases, such as overemphasizing features used in early splits and inheriting biases from the underlying model. Furthermore, the high dimensionality and potential collinearity of molecular fingerprints complicate SHAP interpretation, increasing overfitting risk and compromising SHAP value stability. To provide a general example, we conducted an independent simulation using a publicly available dataset of US industrial facilities and environmental compliance, demonstrating significant discrepancies between feature importance rankings from XGBoost and robust statistical tests. This commentary advocates for robust statistical methods coupled with p-values, including Spearman's rho, Kendall's tau, Goodman-Kruskal's gamma, Somers' delta, and Hoeffding's dependence, for feature selection. These non-parametric methods, which are independent of specific model assumptions and rely on data ranks, are better suited to capture complex relationships in high-dimensional data, providing a more reliable foundation for future PFAS bioaccumulation research.
宋等人(2024年)发表的《基于分子指纹的机器学习模型预测不同植物组织中全氟和多氟烷基物质的生物累积》运用了机器学习方法,如极端梯度提升(XGBoost)和夏普利值附加解释法(SHAP)来预测全氟和多氟烷基物质的生物累积,并报告了较高的预测准确率。然而,本评论批判性地审视了他们对特征重要性的解读,因为高预测准确率并不能保证特征重要性的可靠性。众所周知,XGBoost和SHAP都存在偏差,比如过度强调早期分割中使用的特征以及继承基础模型的偏差。此外,分子指纹的高维度和潜在共线性使SHAP解释变得复杂,增加了过拟合风险并损害了SHAP值的稳定性。为给出一个通用示例,我们使用公开可用的美国工业设施与环境合规数据集进行了独立模拟,结果表明XGBoost得出的特征重要性排名与稳健统计检验之间存在显著差异。本评论主张采用稳健统计方法并结合p值进行特征选择,这些方法包括斯皮尔曼等级相关系数、肯德尔等级相关系数、古德曼-克鲁斯卡尔系数、萨默斯delta系数以及霍夫丁相依系数。这些非参数方法独立于特定的模型假设且依赖数据排名,更适合捕捉高维数据中的复杂关系,为未来全氟和多氟烷基物质生物累积研究提供更可靠的基础。