Abel Haley J, Thomas Alun
University of Utah, Utah, USA.
Stat Appl Genet Mol Biol. 2011;10(1):Article 5. doi: 10.2202/1544-6115.1615. Epub 2011 Jan 6.
We develop recent work on using graphical models for linkage disequilibrium to provide efficient programs for model fitting, phasing, and imputation of missing data in large data sets. Two important features contribute to the computational efficiency: the separation of the model fitting and phasing-imputation processes into different programs, and holding in memory only the data within a moving window of loci during model fitting. Optimal parameter values were chosen by cross-validation to maximize the probability of correctly imputing masked genotypes. The best accuracy obtained is slightly below than that from the Beagle program of Browning and Browning, and our fitting program is slower. However, for large data sets, it uses less storage. For a reference set of n individuals genotyped at m markers, the time and storage required for fitting a graphical model are approximately O(nm) and O(n+m), respectively. To impute the phases and missing data on n individuals using an already fitted graphical model requires O(nm) time and O(m) storage. While the times for fitting and imputation are both O(nm), the imputation process is considerably faster; thus, once a model is estimated from a reference data set, the marginal cost of phasing and imputing further samples is very low.
我们拓展了近期关于使用图形模型进行连锁不平衡分析的工作,以提供高效的程序来拟合模型、进行定相以及对大数据集中的缺失数据进行插补。有两个重要特性有助于提高计算效率:将模型拟合与定相 - 插补过程分离到不同程序中,以及在模型拟合期间仅在内存中保留基因座移动窗口内的数据。通过交叉验证选择最优参数值,以最大化正确插补掩码基因型的概率。所获得的最佳准确性略低于Browning和Browning的Beagle程序,并且我们的拟合程序速度较慢。然而,对于大数据集,它占用的存储空间更少。对于在m个标记上进行基因分型的n个个体的参考集,拟合图形模型所需的时间和存储空间分别约为O(nm)和O(n + m)。使用已拟合的图形模型对n个个体的定相和缺失数据进行插补需要O(nm)时间和O(m)存储空间。虽然拟合和插补的时间均为O(nm),但插补过程要快得多;因此,一旦从参考数据集中估计出模型,对更多样本进行定相和插补的边际成本就非常低。