Orlenko Alena, Moore Jason H
Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA.
BioData Min. 2021 Jan 29;14(1):9. doi: 10.1186/s13040-021-00243-0.
Non-additive interactions among genes are frequently associated with a number of phenotypes, including known complex diseases such as Alzheimer's, diabetes, and cardiovascular disease. Detecting interactions requires careful selection of analytical methods, and some machine learning algorithms are unable or underpowered to detect or model feature interactions that exhibit non-additivity. The Random Forest method is often employed in these efforts due to its ability to detect and model non-additive interactions. In addition, Random Forest has the built-in ability to estimate feature importance scores, a characteristic that allows the model to be interpreted with the order and effect size of the feature association with the outcome. This characteristic is very important for epidemiological and clinical studies where results of predictive modeling could be used to define the future direction of the research efforts. An alternative way to interpret the model is with a permutation feature importance metric which employs a permutation approach to calculate a feature contribution coefficient in units of the decrease in the model's performance and with the Shapely additive explanations which employ cooperative game theory approach. Currently, it is unclear which Random Forest feature importance metric provides a superior estimation of the true informative contribution of features in genetic association analysis.
To address this issue, and to improve interpretability of Random Forest predictions, we compared different methods for feature importance estimation in real and simulated datasets with non-additive interactions. As a result, we detected a discrepancy between the metrics for the real-world datasets and further established that the permutation feature importance metric provides more precise feature importance rank estimation for the simulated datasets with non-additive interactions.
By analyzing both real and simulated data, we established that the permutation feature importance metric provides more precise feature importance rank estimation in the presence of non-additive interactions.
基因间的非加性相互作用常与多种表型相关,包括阿尔茨海默病、糖尿病和心血管疾病等已知的复杂疾病。检测相互作用需要仔细选择分析方法,一些机器学习算法无法检测或对表现出非加性的特征相互作用进行建模,或者能力不足。由于能够检测和对非加性相互作用进行建模,随机森林方法常用于这些研究中。此外,随机森林具有估计特征重要性得分的内置能力,这一特性使得模型能够根据特征与结果关联的顺序和效应大小进行解释。这一特性对于流行病学和临床研究非常重要,在这些研究中,预测模型的结果可用于确定研究工作的未来方向。解释模型的另一种方法是使用排列特征重要性度量,该方法采用排列方法以模型性能下降的单位来计算特征贡献系数,以及使用基于合作博弈论方法的沙普利加性解释。目前,尚不清楚哪种随机森林特征重要性度量能在基因关联分析中对特征的真正信息贡献提供更优估计。
为解决这一问题并提高随机森林预测的可解释性,我们在具有非加性相互作用的真实和模拟数据集中比较了不同的特征重要性估计方法。结果,我们检测到真实世界数据集的度量之间存在差异,并进一步确定排列特征重要性度量为具有非加性相互作用的模拟数据集提供了更精确的特征重要性排名估计。
通过对真实和模拟数据的分析,我们确定在存在非加性相互作用的情况下,排列特征重要性度量提供了更精确的特征重要性排名估计。