Swiss Institute of Bioinformatics, University of Lausanne, Lausanne, Switzerland.
Stat Med. 2013 Sep 20;32(21):3737-51. doi: 10.1002/sim.5792. Epub 2013 Apr 23.
We present a Bayesian approach for estimating the relative frequencies of multi-single nucleotide polymorphism (SNP) haplotypes in populations of the malaria parasite Plasmodium falciparum by using microarray SNP data from human blood samples. Each sample comes from a malaria patient and contains one or several parasite clones that may genetically differ. Samples containing multiple parasite clones with different genetic markers pose a special challenge. The situation is comparable with a polyploid organism. The data from each blood sample indicates whether the parasites in the blood carry a mutant or a wildtype allele at various selected genomic positions. If both mutant and wildtype alleles are detected at a given position in a multiply infected sample, the data indicates the presence of both alleles, but the ratio is unknown. Thus, the data only partially reveals which specific combinations of genetic markers (i.e. haplotypes across the examined SNPs) occur in distinct parasite clones. In addition, SNP data may contain errors at non-negligible rates. We use a multinomial mixture model with partially missing observations to represent this data and a Markov chain Monte Carlo method to estimate the haplotype frequencies in a population. Our approach addresses both challenges, multiple infections and data errors.
我们提出了一种贝叶斯方法,通过使用来自人类血液样本的微阵列 SNP 数据来估计疟原虫寄生虫恶性疟原虫中多单核苷酸多态性(SNP)单倍型的相对频率。每个样本都来自疟疾患者,其中可能包含一个或多个遗传上不同的寄生虫克隆。含有具有不同遗传标记的多个寄生虫克隆的样本带来了特殊的挑战。这种情况类似于多倍体生物。每个血液样本的数据表明,在各种选定的基因组位置,血液中的寄生虫是否携带突变型或野生型等位基因。如果在多感染样本中的给定位置同时检测到突变型和野生型等位基因,则数据表明存在两种等位基因,但比例未知。因此,数据仅部分揭示了在不同寄生虫克隆中出现的特定遗传标记组合(即检查 SNP 之间的单倍型)。此外,SNP 数据可能以不可忽略的速率包含错误。我们使用具有部分缺失观测值的多项混合模型来表示此数据,并使用马尔可夫链蒙特卡罗方法来估计群体中的单倍型频率。我们的方法解决了这两个挑战,即多重感染和数据错误。