Haslam Niall J, Whiteford Nava E, Weber Gerald, Prügel-Bennett Adam, Essex Jonathan W, Neylon Cameron
School of Chemistry, University of Southampton, Southhampton, United Kingdom.
PLoS One. 2008 Jun 18;3(6):e2500. doi: 10.1371/journal.pone.0002500.
Sequencing by hybridisation is an effective method for obtaining large amounts of DNA sequence information at low cost. The efficiency of SBH depends on the design of the probe library to provide the maximum information for minimum cost. Long probes provide a higher probability of non-repeated sequences but lead to an increase in the number of probes required whereas short probes may not provide unique sequence information due to repeated sequences. We have investigated the effect of probe length, use of reference sequences, and thermal filtering on the design of probe libraries for several highly variable target DNA sequences.
We designed overlapping probe libraries for a range of highly variable drug target genes based on known sequence information and develop a formal terminology to describe probe library design. We find that for some targets these libraries can provide good coverage of a previously unseen target whereas for others the coverage is less than 30%. The optimal probe length varies from as short at 12 nt to as large as 19 nt and depends on the sequence, its variability, and the stringency of thermal filtering. It cannot be determined from inspection of an example gene sequence.
Optimal probe length and the optimal number of reference sequences used to design a probe library are highly target specific for highly variable sequencing targets. The optimum design cannot be determined simply by inspection of input sequences or of alignments but only by detailed analysis of the each specific target. For highly variable sequences, shorter probes can in some cases provide better information than longer probes. Probe library design would benefit from a general purpose tool for analysing these issues. The formal terminology developed here and the analysis approaches it is used to describe will contribute to the development of such tools.
杂交测序是一种以低成本获取大量DNA序列信息的有效方法。杂交测序(SBH)的效率取决于探针文库的设计,以便用最低成本提供最大信息。长探针提供非重复序列的概率更高,但会导致所需探针数量增加,而短探针由于存在重复序列可能无法提供唯一的序列信息。我们研究了探针长度、参考序列的使用以及热过滤对几个高度可变的目标DNA序列的探针文库设计的影响。
我们基于已知序列信息为一系列高度可变的药物靶基因设计了重叠探针文库,并开发了一套正式术语来描述探针文库设计。我们发现,对于某些靶标,这些文库可以很好地覆盖以前未见过的靶标,而对于其他靶标,覆盖率则低于30%。最佳探针长度从短至12个核苷酸到长达19个核苷酸不等,这取决于序列、其变异性以及热过滤的严格程度。无法通过检查示例基因序列来确定。
用于设计探针文库的最佳探针长度和最佳参考序列数量对于高度可变的测序靶标具有高度的靶标特异性。最佳设计不能简单地通过检查输入序列或比对来确定,而只能通过对每个特定靶标的详细分析来确定。对于高度可变的序列,在某些情况下,较短的探针可以比较长的探针提供更好的信息。探针文库设计将受益于一种用于分析这些问题的通用工具。这里开发的正式术语及其用于描述的分析方法将有助于此类工具的开发。