Department of Computer Science, Princeton University, Princeton, New Jersey, USA.
Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.
PLoS Comput Biol. 2020 Nov 30;16(11):e1008429. doi: 10.1371/journal.pcbi.1008429. eCollection 2020 Nov.
Aging is a complex process with poorly understood genetic mechanisms. Recent studies have sought to classify genes as pro-longevity or anti-longevity using a variety of machine learning algorithms. However, it is not clear which types of features are best for optimizing classification performance and which algorithms are best suited to this task. Further, performance assessments based on held-out test data are lacking. We systematically compare five popular classification algorithms using gene ontology and gene expression datasets as features to predict the pro-longevity versus anti-longevity status of genes for two model organisms (C. elegans and S. cerevisiae) using the GenAge database as ground truth. We find that elastic net penalized logistic regression performs particularly well at this task. Using elastic net, we make novel predictions of pro- and anti-longevity genes that are not currently in the GenAge database.
衰老是一个复杂的过程,其遗传机制尚未被充分理解。最近的研究试图使用各种机器学习算法将基因分类为长寿或非长寿。然而,目前尚不清楚哪种类型的特征最适合优化分类性能,以及哪种算法最适合这项任务。此外,基于保留测试数据的性能评估也很缺乏。我们系统地比较了五种流行的分类算法,使用基因本体论和基因表达数据集作为特征,使用 GenAge 数据库作为真实数据,预测两种模型生物(秀丽隐杆线虫和酿酒酵母)中基因的长寿与非长寿状态。我们发现弹性网络惩罚逻辑回归在这项任务上表现特别出色。使用弹性网络,我们对目前不在 GenAge 数据库中的长寿和非长寿基因进行了新的预测。