Lim Haw Chuan, Braun Michael J
Department of Vertebrate Zoology, National Museum of Natural History, Smithsonian Institution, Washington, DC, 20560, USA.
Mol Ecol Resour. 2016 Sep;16(5):1204-23. doi: 10.1111/1755-0998.12568. Epub 2016 Aug 8.
Sample availability limits population genetics research on many species, especially taxa from regions with high diversity. However, many such species are well represented in museum collections assembled before the molecular era. Development of techniques to recover genetic data from these invaluable specimens will benefit biodiversity science. Using a mixture of freshly preserved and historical tissue samples, and a sequence capture probe set targeting >5000 loci, we produced high-confidence genotype calls on thousands of single nucleotide polymorphisms (SNPs) in each of five South-East Asian bird species and their close relatives (N = 27-43). On average, 66.2% of the reads mapped to the pseudo-reference genome of each species. Of these mapped reads, an average of 52.7% was identified as PCR or optical duplicates. We achieved deeper effective sequencing for historical samples (122.7×) compared to modern samples (23.5×). The number of nucleotide sites with at least 8× sequencing depth was high, with averages ranging from 0.89 × 10(6) bp (Arachnothera, modern samples) to 1.98 × 10(6) bp (Stachyris, modern samples). Linear regression revealed that the amount of sequence data obtained from each historical sample (represented by per cent of the pseudo-reference genome recovered with ≥8× sequencing depth) was positively and significantly (P ≤ 0.013) related to how recently the sample was collected. We observed characteristic post-mortem damage in the DNA of historical samples. However, we were able to reduce the error rate significantly by truncating ends of reads during read mapping (local alignment) and conducting stringent SNP and genotype filtering.
样本的可获得性限制了对许多物种的群体遗传学研究,尤其是来自高多样性地区的分类群。然而,许多这类物种在分子时代之前收集的博物馆藏品中得到了很好的体现。开发从这些珍贵标本中恢复遗传数据的技术将有益于生物多样性科学。我们使用新鲜保存的和历史组织样本的混合物,以及针对5000多个位点的序列捕获探针组,在五种东南亚鸟类及其近缘种(N = 27 - 43)中的每一种中,对数千个单核苷酸多态性(SNP)产生了高可信度的基因型调用。平均而言,66.2%的 reads 映射到每个物种的伪参考基因组。在这些映射的 reads 中,平均有52.7%被鉴定为PCR或光学重复序列。与现代样本(23.5×)相比,我们对历史样本实现了更深的有效测序(122.7×)。具有至少8×测序深度的核苷酸位点数量很高,平均值范围从0.89×10⁶ bp(长嘴捕蛛鸟,现代样本)到1.98×10⁶ bp(穗鹛,现代样本)。线性回归显示,从每个历史样本获得的序列数据量(以≥8×测序深度恢复的伪参考基因组的百分比表示)与样本采集的时间近度呈正相关且具有显著性(P≤0.013)。我们在历史样本的DNA中观察到了典型的死后损伤。然而,我们能够通过在 reads 映射(局部比对)过程中截断 reads 的末端并进行严格的SNP和基因型过滤来显著降低错误率。