Saul Michelle, Dinu Valentin
College of Health Solutions, Arizona State University, Tempe, AZ 85287-9020, USA.
Caris Life Sciences, Tempe, AZ 85281, USA.
Bioinformatics. 2021 Oct 25;37(20):3626-3631. doi: 10.1093/bioinformatics/btab387.
When designing prediction models built with many features and relatively small sample sizes, feature selection methods often overfit training data, leading to selection of irrelevant features. One way to potentially mitigate overfitting is to incorporate domain knowledge during feature selection. Here, a feature ranking algorithm called 'Family Rank' is presented in which features are ranked based on a combination of graphical domain knowledge and feature scores computed from empirical data.
A simulated dataset is used to demonstrate a scenario in which family rank outperforms other state-of-the-art graph based ranking algorithms, decreasing the sample size needed to detect true predictors by 2- to 3-fold. An example from oncology is then used to explore a real-world application of family rank.
An implementation of Family Rank is freely available at https://cran.r-project.org/package=FamilyRank.
Supplementary data are available at Bioinformatics online.
在设计由许多特征和相对较小样本量构建的预测模型时,特征选择方法常常会过度拟合训练数据,导致选择出不相关的特征。一种可能减轻过度拟合的方法是在特征选择过程中纳入领域知识。在此,提出了一种名为“家族排序”的特征排序算法,其中特征是基于图形领域知识和从经验数据计算得出的特征分数的组合进行排序的。
使用一个模拟数据集来展示一种情况,即家族排序优于其他基于图形的先进排序算法,将检测真实预测因子所需的样本量减少了2至3倍。然后使用肿瘤学中的一个例子来探索家族排序的实际应用。
家族排序的实现可在https://cran.r-project.org/package=FamilyRank上免费获取。
补充数据可在《生物信息学》在线版获取。