Khaiwal Sakshi, De Chiara Matteo, Barré Benjamin P, Barrio-Hernandez Inigo, Stenberg Simon, Beltrao Pedro, Warringer Jonas, Liti Gianni
CNRS, INSERM, IRCAN, Côte d'Azur University, Nice, France.
Institute of Molecular Systems Biology, ETH Zürich, Zürich, 8093, Switzerland.
Mol Syst Biol. 2025 Sep 1. doi: 10.1038/s44320-025-00136-y.
Most organismal traits result from the complex interplay of many genetic and environmental factors, making their prediction difficult. Here, we used machine learning (ML) models to explore phenotype predictions for 223 traits measured across 1011 genome-sequenced Saccharomyces cerevisiae strains isolated worldwide. We benchmarked a ML pipeline with multiple linear and non-linear models to predict phenotypes from genotypes and gene expression, and determined gradient boosting machines as the best-performing model. Gene function disruption scores and gene presence/absence emerged as best predictors, suggesting a considerable contribution of the accessory genome in controlling phenotypes. The prediction accuracy broadly varied among phenotypes, with stress resistance being easier to predict compared to growth across nutrients. ML identified relevant genomic features linked to phenotypes, including high-impact variants with established relationships to phenotypes, despite these being rare in the population. Near-perfect accuracies were achieved when other phenomics data mostly in similar conditions were used, suggesting that useful information can be conveyed across phenotypes. Overall, our study underscores the power of ML to interpret the functional outcome of genetic variants.
大多数生物体性状是由许多遗传和环境因素复杂的相互作用产生的,这使得对它们进行预测变得困难。在这里,我们使用机器学习(ML)模型来探索对全球分离的1011株基因组测序酿酒酵母菌株所测量的223个性状的表型预测。我们用多个线性和非线性模型对一个ML管道进行了基准测试,以从基因型和基因表达预测表型,并确定梯度提升机是性能最佳的模型。基因功能破坏得分和基因存在/缺失成为最佳预测因子,表明辅助基因组在控制表型方面有相当大的贡献。预测准确性在不同表型之间差异很大,与跨营养物质的生长相比,抗逆性更容易预测。ML识别出与表型相关的基因组特征,包括与表型有既定关系的高影响变异,尽管这些变异在群体中很少见。当使用大多在相似条件下的其他表型组学数据时,能实现近乎完美的准确性,这表明有用信息可以在不同表型之间传递。总体而言,我们的研究强调了ML在解释遗传变异功能结果方面的力量。