Bunge John, Böhning Dankmar, Allen Heather, Foster James A
Department of Statistical Science, Cornell University, Ithaca, NY 14853, USA.
Pac Symp Biocomput. 2012:203-12.
We consider the classical population diversity estimation scenario based on frequency count data (the number of classes or taxa represented once, twice, etc. in the sample), but with the proviso that the lowest frequency counts, especially the singletons, may not be reliably observed. This arises especially in data derived from modern high-throughput DNA sequencing, where errors may cause sequences to be incorrectly assigned to new taxa instead of being matched to existing, observed taxa. We look at a spectrum of methods for addressing this issue, focusing in particular on fitting a parametric mixture model and deleting the highest-diversity component; we also consider regarding the data as left-censored and effectively pooling two or more low frequency counts. We find that these purely statistical "downstream" corrections will depend strongly on their underlying assumptions, but that such methods can be useful nonetheless.
我们考虑基于频率计数数据(样本中出现一次、两次等的类别或分类单元的数量)的经典种群多样性估计场景,但前提是最低频率计数,尤其是单例,可能无法可靠地观测到。这在源自现代高通量DNA测序的数据中尤为常见,其中错误可能导致序列被错误地分配到新的分类单元,而不是与现有的、已观测到的分类单元匹配。我们研究了一系列解决此问题的方法,特别关注拟合参数混合模型并删除最高多样性成分;我们还考虑将数据视为左删失数据,并有效地合并两个或更多低频率计数。我们发现,这些纯粹的统计“下游”校正将强烈依赖于其潜在假设,但尽管如此,这些方法仍然可能有用。