Department of Plant Biology, Carnegie Institution for Science, Stanford, California 94305.
Department of Plant Biology, Carnegie Institution for Science, Stanford, California 94305
G3 (Bethesda). 2019 Oct 7;9(10):3129-3138. doi: 10.1534/g3.119.400319.
Linkage mapping is one of the most commonly used methods to identify genetic loci that determine a trait. However, the loci identified by linkage mapping may contain hundreds of candidate genes and require a time-consuming and labor-intensive fine mapping process to find the causal gene controlling the trait. With the availability of a rich assortment of genomic and functional genomic data, it is possible to develop a computational method to facilitate faster identification of causal genes. We developed QTG-Finder, a machine learning based algorithm to prioritize causal genes by ranking genes within a quantitative trait locus (QTL). Two predictive models were trained separately based on known causal genes in Arabidopsis and rice. An independent validation analysis showed that the models could recall about 64% of Arabidopsis and 79% of rice causal genes when the top 20% ranked genes were considered. The top 20% ranked genes can range from 10 to 100 genes, depending on the size of a QTL. The models can prioritize different types of traits though at different efficiency. We also identified several important features of causal genes including paralog copy number, being a transporter, being a transcription factor, and containing SNPs that cause premature stop codon. This work lays the foundation for systematically understanding characteristics of causal genes and establishes a pipeline to predict causal genes based on public data.
连锁分析是确定决定性状的遗传基因座的最常用方法之一。然而,连锁分析所确定的基因座可能包含数百个候选基因,需要进行耗时且劳动密集型的精细定位过程,以找到控制性状的因果基因。随着丰富的基因组和功能基因组数据的可用性,有可能开发一种计算方法来促进更快地识别因果基因。我们开发了 QTG-Finder,这是一种基于机器学习的算法,通过对数量性状基因座(QTL)内的基因进行排名来优先考虑因果基因。分别基于拟南芥和水稻中的已知因果基因训练了两个预测模型。独立的验证分析表明,当考虑排名前 20%的基因时,这两个模型可以召回约 64%的拟南芥和 79%的水稻因果基因。排名前 20%的基因可以根据 QTL 的大小从 10 到 100 个基因不等。这些模型可以优先考虑不同类型的性状,尽管效率不同。我们还确定了因果基因的几个重要特征,包括基因的同源拷贝数、作为转运蛋白、作为转录因子以及包含导致提前终止密码子的 SNP。这项工作为系统地理解因果基因的特征奠定了基础,并建立了一个基于公共数据预测因果基因的流程。