比较用于从基因表达预测复杂性状的统计学习方法。

Comparing statistical learning methods for complex trait prediction from gene expression.

作者信息

Arango Noah Klimkowski, Morgante Fabio

机构信息

Center for Human Genetics, Clemson University, Greenwood, SC, USA.

Department of Genetics and Biochemistry, Clemson University, Clemson, SC, USA.

出版信息

bioRxiv. 2024 Jun 3:2024.06.01.596951. doi: 10.1101/2024.06.01.596951.

DOI:10.1101/2024.06.01.596951

PMID:38895364

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11185554/

Abstract

Accurate prediction of complex traits is an important task in quantitative genetics that has become increasingly relevant for personalized medicine. Genotypes have traditionally been used for trait prediction using a variety of methods such as mixed models, Bayesian methods, penalized regressions, dimension reductions, and machine learning methods. Recent studies have shown that gene expression levels can produce higher prediction accuracy than genotypes. However, only a few prediction methods were used in these studies. Thus, a comprehensive assessment of methods is needed to fully evaluate the potential of gene expression as a predictor of complex trait phenotypes. Here, we used data from the Genetic Reference Panel (DGRP) to compare the ability of several existing statistical learning methods to predict starvation resistance from gene expression in the two sexes separately. The methods considered differ in assumptions about the distribution of gene effect sizes - ranging from models that assume that every gene affects the trait to more sparse models - and their ability to capture gene-gene interactions. We also used functional annotation (, Gene Ontology (GO)) as an external source of biological information to inform prediction models. The results show that differences in prediction accuracy between methods exist, although they are generally not large. Methods performing variable selection gave higher accuracy in females while methods assuming a more polygenic architecture performed better in males. Incorporating GO annotations further improved prediction accuracy for a few GO terms of biological significance. Biological significance extended to the genes underlying highly predictive GO terms with different genes emerging between sexes. Notably, the Insulin-like Receptor () was prevalent across methods and sexes. Our results confirmed the potential of transcriptomic prediction and highlighted the importance of selecting appropriate methods and strategies in order to achieve accurate predictions.

摘要

复杂性状的准确预测是数量遗传学中的一项重要任务，对于个性化医疗而言，其相关性日益增强。传统上，基因型已被用于性状预测，使用了多种方法，如混合模型、贝叶斯方法、惩罚回归、降维以及机器学习方法。最近的研究表明，基因表达水平能够产生比基因型更高的预测准确性。然而，这些研究中仅使用了少数几种预测方法。因此，需要对方法进行全面评估，以充分评估基因表达作为复杂性状表型预测指标的潜力。在此，我们使用了来自遗传参考面板（DGRP）的数据，分别比较了几种现有统计学习方法从基因表达预测两性饥饿抗性的能力。所考虑的方法在关于基因效应大小分布的假设方面存在差异——从假设每个基因都影响性状的模型到更稀疏的模型——以及它们捕捉基因 - 基因相互作用的能力。我们还使用功能注释（基因本体论（GO））作为生物学信息的外部来源，为预测模型提供信息。结果表明，尽管方法之间的预测准确性差异通常不大，但确实存在差异。进行变量选择的方法在雌性中给出了更高的准确性，而假设更具多基因结构的方法在雄性中表现更好。纳入GO注释进一步提高了一些具有生物学意义的GO术语的预测准确性。生物学意义扩展到高度预测性GO术语背后的基因，不同性别中出现了不同的基因。值得注意的是，胰岛素样受体在各种方法和性别中都很普遍。我们的结果证实了转录组预测的潜力，并强调了选择合适的方法和策略以实现准确预测的重要性。