Gardner Shea N, Lam Marisa W, Smith Jason R, Torres Clinton L, Slezak Tom R
Pathogen Bio-Informatics, Lawrence Livermore National Laboratory, PO Box 808, L-174, Livermore, CA 94551, USA.
Nucleic Acids Res. 2005 Oct 20;33(18):5838-50. doi: 10.1093/nar/gki896. Print 2005.
Sequencing pathogen genomes is costly, demanding careful allocation of limited sequencing resources. We built a computational Sequencing Analysis Pipeline (SAP) to guide decisions regarding the amount of genomic sequencing necessary to develop high-quality diagnostic DNA and protein signatures. SAP uses simulations to estimate the number of target genomes and close phylogenetic relatives (near neighbors or NNs) to sequence. We use SAP to assess whether draft data are sufficient or finished sequencing is required using Marburg and variola virus sequences. Simulations indicate that intermediate to high-quality draft with error rates of 10(-3)-10(-5) (approximately 8x coverage) of target organisms is suitable for DNA signature prediction. Low-quality draft with error rates of approximately 1% (3x to 6x coverage) of target isolates is inadequate for DNA signature prediction, although low-quality draft of NNs is sufficient, as long as the target genomes are of high quality. For protein signature prediction, sequencing errors in target genomes substantially reduce the detection of amino acid sequence conservation, even if the draft is of high quality. In summary, high-quality draft of target and low-quality draft of NNs appears to be a cost-effective investment for DNA signature prediction, but may lead to underestimation of predicted protein signatures.
对病原体基因组进行测序成本高昂,需要谨慎分配有限的测序资源。我们构建了一个计算测序分析流程(SAP),以指导关于开发高质量诊断DNA和蛋白质特征所需的基因组测序量的决策。SAP使用模拟来估计要测序的目标基因组数量和密切的系统发育亲属(近邻或NNs)。我们使用SAP通过马尔堡病毒和天花病毒序列评估草图数据是否足够或是否需要完成测序。模拟表明,目标生物体错误率为10^(-3)-10^(-5)(约8倍覆盖度)的中等至高质量草图适用于DNA特征预测。目标分离株错误率约为1%(3倍至6倍覆盖度)的低质量草图不足以进行DNA特征预测,不过只要目标基因组质量高,NNs的低质量草图就足够。对于蛋白质特征预测,即使草图质量高,目标基因组中的测序错误也会大幅降低氨基酸序列保守性的检测。总之,目标的高质量草图和NNs的低质量草图似乎是DNA特征预测的一种经济有效的投入,但可能会导致预测的蛋白质特征被低估。