拟南芥从基因型到表型：基于测序数据的计算机基因组解读预测 288 种表型。

From genotype to phenotype in Arabidopsis thaliana: in-silico genome interpretation predicts 288 phenotypes from sequencing data.

机构信息

ESAT-STADIUS, KU Leuven, 3001 Leuven, Belgium.

Institut Jean-Pierre Bourgin, Université Paris-Saclay, INRAE, AgroParisTech, 78000 Versailles, France.

出版信息

Nucleic Acids Res. 2022 Feb 22;50(3):e16. doi: 10.1093/nar/gkab1099.

DOI:10.1093/nar/gkab1099

PMID:34792168

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8860592/

Abstract

In many cases, the unprecedented availability of data provided by high-throughput sequencing has shifted the bottleneck from a data availability issue to a data interpretation issue, thus delaying the promised breakthroughs in genetics and precision medicine, for what concerns Human genetics, and phenotype prediction to improve plant adaptation to climate change and resistance to bioagressors, for what concerns plant sciences. In this paper, we propose a novel Genome Interpretation paradigm, which aims at directly modeling the genotype-to-phenotype relationship, and we focus on A. thaliana since it is the best studied model organism in plant genetics. Our model, called Galiana, is the first end-to-end Neural Network (NN) approach following the genomes in/phenotypes out paradigm and it is trained to predict 288 real-valued Arabidopsis thaliana phenotypes from Whole Genome sequencing data. We show that 75 of these phenotypes are predicted with a Pearson correlation ≥0.4, and are mostly related to flowering traits. We show that our end-to-end NN approach achieves better performances and larger phenotype coverage than models predicting single phenotypes from the GWAS-derived known associated genes. Galiana is also fully interpretable, thanks to the Saliency Maps gradient-based approaches. We followed this interpretation approach to identify 36 novel genes that are likely to be associated with flowering traits, finding evidence for 6 of them in the existing literature.

摘要

在许多情况下，高通量测序提供的前所未有的数据可用性已经将瓶颈从数据可用性问题转移到数据解释问题，从而延迟了遗传学和精准医学方面的承诺突破，就人类遗传学而言，以及表型预测以提高植物对气候变化的适应能力和对生物侵害者的抵抗力，就植物科学而言。在本文中，我们提出了一种新的基因组解释范例，旨在直接模拟基因型到表型的关系，我们专注于拟南芥，因为它是植物遗传学中研究最好的模式生物。我们的模型称为 Galiana，是第一个遵循基因组内/表型外范例的端到端神经网络 (NN) 方法，它经过训练可从全基因组测序数据中预测 288 个真实的拟南芥表型。我们表明，其中 75 个表型的 Pearson 相关系数≥0.4，且主要与开花性状有关。我们表明，我们的端到端神经网络方法比从 GWAS 衍生的已知相关基因预测单个表型的模型具有更好的性能和更大的表型覆盖范围。Galiana 也具有完全可解释性，这要归功于基于梯度的显着性图方法。我们遵循这种解释方法来识别 36 个可能与开花性状相关的新基因，并在现有文献中找到了其中 6 个的证据。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fce8/8860592/1312e60f5a47/gkab1099fig1.jpg

相似文献

From genotype to phenotype in Arabidopsis thaliana: in-silico genome interpretation predicts 288 phenotypes from sequencing data.拟南芥从基因型到表型：基于测序数据的计算机基因组解读预测 288 种表型。

Nucleic Acids Res. 2022 Feb 22;50(3):e16. doi: 10.1093/nar/gkab1099.

Biologically meaningful genome interpretation models to address data underdetermination for the leaf and seed ionome prediction in Arabidopsis thaliana.用于解决拟南芥叶和种子离子组预测中数据不足问题的具有生物学意义的基因组解释模型。

Sci Rep. 2024 Jun 8;14(1):13188. doi: 10.1038/s41598-024-63855-6.

Genotype networks of 80 quantitative Arabidopsis thaliana phenotypes reveal phenotypic evolvability despite pervasive epistasis.80 种拟南芥表型的基因型网络揭示了表型可进化性，尽管存在普遍的上位性。

PLoS Comput Biol. 2020 Aug 13;16(8):e1008082. doi: 10.1371/journal.pcbi.1008082. eCollection 2020 Aug.

Genome-Wide Association Studies in Arabidopsis thaliana: Statistical Analysis and Network-Based Augmentation of Signals.拟南芥全基因组关联研究：信号的统计分析和基于网络的增强。

Methods Mol Biol. 2021;2200:187-210. doi: 10.1007/978-1-0716-0880-7_9.

Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines.对拟南芥近交系 107 个表型进行全基因组关联研究。

Nature. 2010 Jun 3;465(7298):627-31. doi: 10.1038/nature08800. Epub 2010 Mar 24.

The AraGWAS Catalog: a curated and standardized Arabidopsis thaliana GWAS catalog.AraGWAS 目录：一个经过策展和标准化的拟南芥 GWAS 目录。

Nucleic Acids Res. 2018 Jan 4;46(D1):D1150-D1156. doi: 10.1093/nar/gkx954.

Frequency and Spectrum of Mutations Induced by Gamma Rays Revealed by Phenotype Screening and Whole-Genome Re-Sequencing in .通过表型筛选和全基因组重测序揭示 γ 射线诱导的突变的频率和频谱。

Int J Mol Sci. 2022 Jan 7;23(2):654. doi: 10.3390/ijms23020654.

Association mapping of germination traits in Arabidopsis thaliana under light and nutrient treatments: searching for G×E effects.拟南芥在光照和养分处理下萌发性状的关联图谱分析：探寻基因×环境互作效应

G3 (Bethesda). 2014 Jun 5;4(8):1465-78. doi: 10.1534/g3.114.012427.

Phenotypic and genome-wide association with the local environment of Arabidopsis.拟南芥表型和全基因组与局部环境的关联。

Nat Ecol Evol. 2019 Feb;3(2):274-285. doi: 10.1038/s41559-018-0754-5. Epub 2019 Jan 14.

araGWAB: Network-based boosting of genome-wide association studies in Arabidopsis thaliana.araGWAB：基于网络的拟南芥全基因组关联研究增强方法

Sci Rep. 2018 Feb 13;8(1):2925. doi: 10.1038/s41598-018-21301-4.

引用本文的文献

Explainable deep learning for stratified medicine in inflammatory bowel disease.用于炎症性肠病分层医学的可解释深度学习

Genome Biol. 2025 Jul 24;26(1):223. doi: 10.1186/s13059-025-03692-6.

Genomic prediction with kinship-based multiple kernel learning produces hypothesis on the underlying inheritance mechanisms of phenotypic traits.基于亲缘关系的多核学习进行基因组预测，能够对表型性状的潜在遗传机制提出假设。

Genome Biol. 2025 Apr 4;26(1):84. doi: 10.1186/s13059-025-03544-3.

A Feature Engineering Method for Whole-Genome DNA Sequence with Nucleotide Resolution.一种具有核苷酸分辨率的全基因组DNA序列特征工程方法。

Int J Mol Sci. 2025 Mar 4;26(5):2281. doi: 10.3390/ijms26052281.

Comparison of machine learning methods for genomic prediction of selected Arabidopsis thaliana traits.基于机器学习方法的拟南芥部分性状全基因组预测比较。

PLoS One. 2024 Aug 28;19(8):e0308962. doi: 10.1371/journal.pone.0308962. eCollection 2024.

Sci Rep. 2024 Jun 8;14(1):13188. doi: 10.1038/s41598-024-63855-6.

Genome interpretation in a federated learning context allows the multi-center exome-based risk prediction of Crohn's disease patients.在联邦学习环境中进行基因组解读，可实现基于多中心外显子组的克罗恩病患者风险预测。

Sci Rep. 2023 Nov 9;13(1):19449. doi: 10.1038/s41598-023-46887-2.

Large sample size and nonlinear sparse models outline epistatic effects in inflammatory bowel disease.大样本量和非线性稀疏模型概述了炎症性肠病中的上位效应。

Genome Biol. 2023 Oct 5;24(1):224. doi: 10.1186/s13059-023-03064-y.

Editorial: Towards genome interpretation: Computational methods to model the genotype-phenotype relationship.社论：迈向基因组解读：用于建立基因型-表型关系模型的计算方法

Front Bioinform. 2022 Nov 30;2:1098941. doi: 10.3389/fbinf.2022.1098941. eCollection 2022.

本文引用的文献

An interpretable low-complexity machine learning framework for robust exome-based - diagnosis of Crohn's disease patients.一种用于基于外显子组的克罗恩病患者稳健诊断的可解释低复杂度机器学习框架。

NAR Genom Bioinform. 2020 Feb 21;2(1):lqaa011. doi: 10.1093/nargab/lqaa011. eCollection 2020 Mar.

A novel method for data fusion over entity-relation graphs and its application to protein-protein interaction prediction.一种基于实体关系图的数据融合新方法及其在蛋白质-蛋白质相互作用预测中的应用。

Bioinformatics. 2021 Aug 25;37(16):2275-2281. doi: 10.1093/bioinformatics/btab092.

Genome-Wide Prediction of Complex Traits in Two Outcrossing Plant Species Through Deep Learning and Bayesian Regularized Neural Network.通过深度学习和贝叶斯正则化神经网络对两种异交植物物种复杂性状进行全基因组预测

Front Plant Sci. 2020 Nov 27;11:593897. doi: 10.3389/fpls.2020.593897. eCollection 2020.

Insight into the protein solubility driving forces with neural attention.用神经注意力洞察蛋白质溶解度驱动力。

PLoS Comput Biol. 2020 Apr 30;16(4):e1007722. doi: 10.1371/journal.pcbi.1007722. eCollection 2020 Apr.

An Improved Phenotype-Driven Tool for Rare Mendelian Variant Prioritization: Benchmarking Exomiser on Real Patient Whole-Exome Data.一种用于罕见孟德尔变异优先级排序的改进型表型驱动工具：在真实患者全外显子数据上对Exomiser进行基准测试。

Genes (Basel). 2020 Apr 23;11(4):460. doi: 10.3390/genes11040460.

An evaluation of machine-learning for predicting phenotype: studies in yeast, rice, and wheat.用于预测表型的机器学习评估：酵母、水稻和小麦的研究

Mach Learn. 2020;109(2):251-277. doi: 10.1007/s10994-019-05848-5. Epub 2019 Oct 23.

AraPheno and the AraGWAS Catalog 2020: a major database update including RNA-Seq and knockout mutation data for Arabidopsis thaliana.AraPheno 和 AraGWAS 目录 2020：一个主要的数据库更新，包括拟南芥的 RNA-Seq 和敲除突变数据。

Nucleic Acids Res. 2020 Jan 8;48(D1):D1063-D1068. doi: 10.1093/nar/gkz925.

On the Upper Bounds of the Real-Valued Predictions.关于实值预测的上界

Bioinform Biol Insights. 2019 Aug 23;13:1177932219871263. doi: 10.1177/1177932219871263. eCollection 2019.

The illusion of polygenic disease risk prediction.多基因疾病风险预测的幻象。

Genet Med. 2019 Aug;21(8):1705-1707. doi: 10.1038/s41436-018-0418-5. Epub 2019 Jan 12.

Polygenic risk scores: a biased prediction?多基因风险评分：有偏差的预测？

Genome Med. 2018 Dec 27;10(1):100. doi: 10.1186/s13073-018-0610-x.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

拟南芥从基因型到表型：基于测序数据的计算机基因组解读预测 288 种表型。

From genotype to phenotype in Arabidopsis thaliana: in-silico genome interpretation predicts 288 phenotypes from sequencing data.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献