White O, Dunning T, Sutton G, Adams M, Venter J C, Fields C
Institute for Genomic Research, Gaithersburg, MD 20878.
Nucleic Acids Res. 1993 Aug 11;21(16):3829-38. doi: 10.1093/nar/21.16.3829.
Heterologous DNA sequences from rearrangements with the genomes of host cells, genomic fragments from hybrid cells, or impure tissue sources can threaten the purity of libraries that are derived from RNA or DNA. Hybridization methods can only detect contaminants from known or suspected heterologous sources, and whole library screening is technically very difficult. Detection of contaminating heterologous clones by sequence alignment is only possible when related sequences are present in a known database. We have developed a statistical test to identify heterologous sequences that is based on the differences in hexamer composition of DNA from different organisms. This test does not require that sequences similar to potential heterologous contaminants are present in the database, and can in principle detect contamination by previously unknown organisms. We have applied this test to the major public expressed sequence tag (EST) data sets to evaluate its utility as a quality control measure and a peer evaluation tool. There is detectable heterogeneity in most human and C.elegans EST data sets but it is not apparently associated with cross-species contamination. However, there is direct evidence for both yeast and bacterial sequence contamination in some public database sequences annotated as human. Results obtained with the hexamer test have been confirmed with similarity searches using sequences from the relevant data sets.
与宿主细胞基因组重排产生的异源DNA序列、杂交细胞的基因组片段或不纯的组织来源可能会威胁到源自RNA或DNA的文库的纯度。杂交方法只能检测来自已知或疑似异源来源的污染物,而对整个文库进行筛选在技术上非常困难。只有当相关序列存在于已知数据库中时,才能通过序列比对检测出污染的异源克隆。我们开发了一种基于不同生物体DNA六聚体组成差异的统计测试方法来识别异源序列。该测试不需要数据库中存在与潜在异源污染物相似的序列,原则上可以检测出以前未知生物体的污染。我们已将此测试应用于主要的公共表达序列标签(EST)数据集,以评估其作为质量控制措施和同行评估工具的效用。在大多数人类和秀丽隐杆线虫EST数据集中可检测到异质性,但这显然与跨物种污染无关。然而,在一些注释为人类的公共数据库序列中,有直接证据表明存在酵母和细菌序列污染。使用相关数据集的序列进行相似性搜索,已证实了六聚体测试的结果。