Hori Tomoaki, Montcho David, Agbangla Clement, Ebana Kaworu, Futakuchi Koichi, Iwata Hiroyoshi
Department of Agricultural and Environmental Biology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo, 113-8657, Japan.
Africa Rice Center, 01 B.P. 2031, Cotonou, Benin.
Theor Appl Genet. 2016 Nov;129(11):2101-2115. doi: 10.1007/s00122-016-2760-9. Epub 2016 Aug 19.
A method based on a multi-task Gaussian process using self-measuring similarity gave increased accuracy for imputing missing phenotypic data in multi-trait and multi-environment trials. Multi-environmental trial (MET) data often encounter the problem of missing data. Accurate imputation of missing data makes subsequent analysis more effective and the results easier to understand. Moreover, accurate imputation may help to reduce the cost of phenotyping for thinned-out lines tested in METs. METs are generally performed for multiple traits that are correlated to each other. Correlation among traits can be useful information for imputation, but single-trait-based methods cannot utilize information shared by traits that are correlated. In this paper, we propose imputation methods based on a multi-task Gaussian process (MTGP) using self-measuring similarity kernels reflecting relationships among traits, genotypes, and environments. This framework allows us to use genetic correlation among multi-trait multi-environment data and also to combine MET data and marker genotype data. We compared the accuracy of three MTGP methods and iterative regularized PCA using rice MET data. Two scenarios for the generation of missing data at various missing rates were considered. The MTGP performed a better imputation accuracy than regularized PCA, especially at high missing rates. Under the 'uniform' scenario, in which missing data arise randomly, inclusion of marker genotype data in the imputation increased the imputation accuracy at high missing rates. Under the 'fiber' scenario, in which missing data arise in all traits for some combinations between genotypes and environments, the inclusion of marker genotype data decreased the imputation accuracy for most traits while increasing the accuracy in a few traits remarkably. The proposed methods will be useful for solving the missing data problem in MET data.
一种基于使用自测量相似度的多任务高斯过程的方法,在多性状和多环境试验中对缺失表型数据进行插补时提高了准确性。多环境试验(MET)数据经常遇到数据缺失的问题。准确插补缺失数据可使后续分析更有效,结果更易于理解。此外,准确插补可能有助于降低在MET中测试的稀疏品系的表型分析成本。MET通常针对多个相互关联的性状进行。性状间的相关性对于插补可能是有用的信息,但基于单性状的方法无法利用相关性状共享的信息。在本文中,我们提出了基于多任务高斯过程(MTGP)的插补方法,该方法使用反映性状、基因型和环境之间关系的自测量相似度核。这个框架使我们能够利用多性状多环境数据中的遗传相关性,还能将MET数据和标记基因型数据结合起来。我们使用水稻MET数据比较了三种MTGP方法和迭代正则化主成分分析(PCA)的准确性。考虑了两种在不同缺失率下生成缺失数据的场景。MTGP的插补准确性比正则化PCA更好,尤其是在高缺失率时。在缺失数据随机出现的“均匀”场景下,在插补中纳入标记基因型数据在高缺失率时提高了插补准确性。在基因型和环境的某些组合中所有性状都出现缺失数据的“纤维”场景下,纳入标记基因型数据降低了大多数性状的插补准确性,同时显著提高了少数性状的准确性。所提出的方法将有助于解决MET数据中的缺失数据问题。