Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany.
BMC Bioinformatics. 2021 Feb 18;22(1):74. doi: 10.1186/s12859-021-04011-z.
One component of precision medicine is to construct prediction models with their predicitve ability as high as possible, e.g. to enable individual risk prediction. In genetic epidemiology, complex diseases like coronary artery disease, rheumatoid arthritis, and type 2 diabetes, have a polygenic basis and a common assumption is that biological and genetic features affect the outcome under consideration via interactions. In the case of omics data, the use of standard approaches such as generalized linear models may be suboptimal and machine learning methods are appealing to make individual predictions. However, most of these algorithms focus mostly on main or marginal effects of the single features in a dataset. On the other hand, the detection of interacting features is an active area of research in the realm of genetic epidemiology. One big class of algorithms to detect interacting features is based on the multifactor dimensionality reduction (MDR). Here, we further develop the model-based MDR (MB-MDR), a powerful extension of the original MDR algorithm, to enable interaction empowered individual prediction.
Using a comprehensive simulation study we show that our new algorithm (median AUC: 0.66) can use information hidden in interactions and outperforms two other state-of-the-art algorithms, namely the Random Forest (median AUC: 0.54) and Elastic Net (median AUC: 0.50), if interactions are present in a scenario of two pairs of two features having small effects. The performance of these algorithms is comparable if no interactions are present. Further, we show that our new algorithm is applicable to real data by comparing the performance of the three algorithms on a dataset of rheumatoid arthritis cases and healthy controls. As our new algorithm is not only applicable to biological/genetic data but to all datasets with discrete features, it may have practical implications in other research fields where interactions between features have to be considered as well, and we made our method available as an R package ( https://github.com/imbs-hl/MBMDRClassifieR ).
The explicit use of interactions between features can improve the prediction performance and thus should be included in further attempts to move precision medicine forward.
精准医学的一个组成部分是构建预测能力尽可能高的预测模型,例如,实现个体风险预测。在遗传流行病学中,复杂疾病,如冠状动脉疾病、类风湿关节炎和 2 型糖尿病,具有多基因基础,一个常见的假设是,生物和遗传特征通过相互作用影响所考虑的结果。在组学数据的情况下,使用广义线性模型等标准方法可能不是最优的,机器学习方法很有吸引力,可以进行个体预测。然而,这些算法中的大多数主要关注数据集单个特征的主要或边际效应。另一方面,检测相互作用的特征是遗传流行病学领域的一个活跃研究领域。一类用于检测相互作用特征的算法是基于多因子降维(MDR)的。在这里,我们进一步开发了基于模型的 MDR(MB-MDR),这是原始 MDR 算法的强大扩展,以实现具有交互功能的个体预测。
使用全面的模拟研究,我们表明,我们的新算法(中位数 AUC:0.66)可以利用隐藏在相互作用中的信息,如果在存在两对具有小效应的两个特征的情况下存在相互作用,其表现优于另外两种最先进的算法,即随机森林(中位数 AUC:0.54)和弹性网络(中位数 AUC:0.50)。如果不存在相互作用,则这些算法的性能相当。此外,我们通过比较三种算法在类风湿关节炎病例和健康对照数据集上的性能,表明我们的新算法适用于真实数据。由于我们的新算法不仅适用于生物/遗传数据,而且适用于具有离散特征的所有数据集,因此它可能在其他需要考虑特征之间相互作用的研究领域具有实际意义,我们还将该方法作为 R 包(https://github.com/imbs-hl/MBMDRClassifieR)提供。
明确使用特征之间的相互作用可以提高预测性能,因此应该包含在进一步推动精准医学的尝试中。