Bioinformatics Interdepartmental Program, University of California, Los Angeles, USA.
Mol Biol Evol. 2013 May;30(5):1145-58. doi: 10.1093/molbev/mst016. Epub 2013 Jan 30.
DNA samples are often pooled, either by experimental design or because the sample itself is a mixture. For example, when population allele frequencies are of primary interest, individual samples may be pooled together to lower the cost of sequencing. Alternatively, the sample itself may be a mixture of multiple species or strains (e.g., bacterial species comprising a microbiome or pathogen strains in a blood sample). We present an expectation-maximization algorithm for estimating haplotype frequencies in a pooled sample directly from mapped sequence reads, in the case where the possible haplotypes are known. This method is relevant to the analysis of pooled sequencing data from selection experiments, as well as the calculation of proportions of different species within a metagenomics sample. Our method outperforms existing methods based on single-site allele frequencies, as well as simple approaches using sequence read data. We have implemented the method in a freely available open-source software tool.
DNA 样本通常会被混合,无论是出于实验设计还是因为样本本身就是混合物。例如,当群体等位基因频率是主要关注点时,可能会将个体样本混合在一起,以降低测序成本。或者,样本本身可能是多种物种或菌株的混合物(例如,微生物组中的细菌物种或血液样本中的病原体菌株)。我们提出了一种期望最大化算法,用于在已知可能的单倍型情况下,直接从映射的序列读取中估计混合样本中的单倍型频率。该方法与从选择实验中混合测序数据的分析以及计算宏基因组样本中不同物种的比例有关。我们的方法优于基于单一位点等位基因频率的现有方法,以及使用序列读取数据的简单方法。我们已经在一个免费的开源软件工具中实现了该方法。