School of Medicine, Institute of Biomedicine, University of Eastern Finland, Kuopio, Finland.
Bioinformatics. 2022 Sep 16;38(Suppl_2):ii20-ii26. doi: 10.1093/bioinformatics/btac463.
In modern translational research, the development of biomarkers heavily relies on use of omics technologies, but implementations with basic data mining algorithms frequently lead to false positives. Non-dominated Sorting Genetic Algorithm II (NSGA2) is an extremely effective algorithm for biomarker discovery but has been rarely evaluated against large-scale datasets. The exploration of the feature search space is the key to NSGA2 success but in specific cases NSGA2 expresses a shallow exploration of the space of possible feature combinations, possibly leading to models with low predictive performances.
We propose two improved NSGA2 algorithms for finding subsets of biomarkers exhibiting different trade-offs between accuracy and feature number. The performances are investigated on gene expression data of breast cancer patients. The results are compared with NSGA2 and LASSO. The benchmarking dataset includes internal and external validation sets. The results show that the proposed algorithms generate a better approximation of the optimal trade-offs between accuracy and set size. Moreover, validation and test accuracies are better than those provided by NSGA2 and LASSO. Remarkably, the GA-based methods provide biomarkers that achieve a very high prediction accuracy (>80%) with a small number of features (<10), representing a valid alternative to known biomarker models, such as Pam50 and MammaPrint.
The software is publicly available on GitHub at github.com/UEFBiomedicalInformaticsLab/BIODAI/tree/main/MOO.
Supplementary data are available at Bioinformatics online.
在现代转化研究中,生物标志物的发展很大程度上依赖于组学技术的应用,但基本数据挖掘算法的实施经常导致假阳性。非支配排序遗传算法 II(NSGA2)是一种非常有效的生物标志物发现算法,但很少针对大规模数据集进行评估。NSGA2 成功的关键在于对特征搜索空间的探索,但在特定情况下,NSGA2 对可能的特征组合空间的探索较浅,可能导致模型的预测性能较低。
我们提出了两种改进的 NSGA2 算法,用于寻找在准确性和特征数量之间表现出不同权衡的生物标志物子集。我们在乳腺癌患者的基因表达数据上对这些算法的性能进行了研究。将结果与 NSGA2 和 LASSO 进行了比较。基准数据集包括内部和外部验证集。结果表明,所提出的算法可以更好地逼近准确性和集合大小之间的最优权衡。此外,验证和测试准确性优于 NSGA2 和 LASSO 提供的准确性。值得注意的是,基于 GA 的方法提供了具有少数特征(<10)的生物标志物,实现了非常高的预测准确性(>80%),这是 Pam50 和 MammaPrint 等已知生物标志物模型的有效替代方案。
该软件可在 GitHub 上公开获得,网址为 github.com/UEFBiomedicalInformaticsLab/BIODAI/tree/main/MOO。
补充数据可在 Bioinformatics 在线获得。