Devlin B, Risch N, Roeder K
Department of Epidemiology and Public Health, Yale University, New Haven, CT 06510.
Am J Hum Genet. 1991 Apr;48(4):662-76.
VNTR loci provide valuable information for a number of fields of study involving human genetics, ranging from forensics (DNA fingerprinting and paternity testing) to linkage analysis and population genetics. Alleles of a VNTR locus are simply fragments obtained from a particular portion of the DNA molecule and are defined in terms of their length. The essential element of a VNTR fragment is the repeat, which is a short sequence of basepairs. The core of the fragment is composed of a variable number of identical repeats that are linked in tandem. A sample of fragments from a population of individuals exhibits substantial variation in length because of variation in the number of repeats. Each distinct fragment length defines an allele, but any given fragment is measured with error. Therefore the observed distribution of fragment lengths is not discrete but is continuous, and determination of distinct allele classes is not straightforward. A mixture model is the natural statistical method for estimating the allele frequencies of VNTR loci. In this article we develop nonparametric methods for obtaining the distribution of allele sizes and estimates of their frequencies. Methods for obtaining maximum-likelihood estimates are developed. In addition, we suggest an empirical Bayes method to improve the maximum-likelihood estimates of the gene frequencies; the empirical Bayes procedure effects a local smoothing. The latter method works particularly well when measurement error is large relative to the repeat size, because the estimated distribution of allele frequencies when maximum likelihood is used is unreliable because of an alternating pattern of over- and underestimation. We define alleles and estimate the allele frequencies for two VNTR loci from the human genome (D17S79 and D2S44), from data obtained from Lifecodes, Inc.
可变数目串联重复序列(VNTR)位点为涉及人类遗传学的多个研究领域提供了有价值的信息,范围从法医学(DNA指纹识别和亲子鉴定)到连锁分析和群体遗传学。VNTR位点的等位基因仅仅是从DNA分子的特定部分获得的片段,并根据其长度来定义。VNTR片段的基本要素是重复序列,它是一小段碱基对序列。片段的核心由可变数量的串联连接的相同重复序列组成。由于重复序列数量的变化,来自个体群体的片段样本在长度上表现出很大的差异。每个不同的片段长度定义一个等位基因,但任何给定片段的测量都存在误差。因此,观察到的片段长度分布不是离散的,而是连续的,并且确定不同的等位基因类别并非易事。混合模型是估计VNTR位点等位基因频率的自然统计方法。在本文中,我们开发了非参数方法来获得等位基因大小的分布及其频率估计。开发了获得最大似然估计的方法。此外,我们提出了一种经验贝叶斯方法来改进基因频率的最大似然估计;经验贝叶斯程序实现了局部平滑。当测量误差相对于重复序列大小较大时,后一种方法效果特别好,因为使用最大似然法时估计的等位基因频率分布由于高估和低估的交替模式而不可靠。我们根据从Lifecodes公司获得的数据,定义了人类基因组中两个VNTR位点(D17S79和D2S44)的等位基因并估计了等位基因频率。