Atherosclerosis and Vascular Medicine Section, Department of Medicine, Methodist DeBakey Heart Center, Baylor College of Medicine, 6565 Fannin Street,Houston, TX 77030, USA.
DNA Cell Biol. 2012 Feb;31(2):151-63. doi: 10.1089/dna.2011.1339. Epub 2011 Sep 6.
Palindromati, the massive host-edited synthetic palindromic contamination found in GenBank, is illustrated and exemplified. Millions of contaminated sequences with portions or tandems of such portions derived from the ZAP adaptor or related linkers are shown (1) by the 12-bp sequence reported elsewhere, exon Xb, 5' CCCGAATTCGGG 3', (2) by a 22-bp related sequence 5' CTCGTGCCGAATTCGGCACGAG 3', and (3) by a longer 44-bp related sequence: 5' CTCGTGCCGAATTCGGCACGAGCTCGTGCCGAATTCGGCACGAG 3'. Possible reasons for why those long contaminating sequences continue in the databases are presented here: (1) the recognition site for the plus strand (+) is single-strand self-annealed; (2) the recognition site for the minus strand (-) is not only single-strand self-annealed but also located far away from the single-strand self-annealed plus strand, rendering impossible the formation of the active EcoRI enzyme dimer to cut on 5' G/AATTC 3', its target sequence. As a possible solution, it is suggested to rely on at least two or three independent results, such as sequences obtained by independent laboratories with the use, preferably, of independent sequencing methodologies. This information may help to develop tools for bioinformatics capable to detect/remove these contaminants and to infer why some damaged sequences which cause genetic diseases escape detection by the molecular quality control mechanism of cells and organisms, being undesirably transferred unchecked through the generations.
Palindromati 是在 GenBank 中发现的大规模宿主编辑合成的回文污染,文中对此进行了说明和举例。数以百万计的污染序列中,部分或串联的此类序列来自 ZAP 接头或相关链接器,(1)通过其他地方报道的 12 个碱基序列 exon Xb,5'CCCGAATTCGGG3',(2)通过相关的 22 个碱基序列 5'CTC GTGCCGAATTCGGCACGAG3',(3)通过更长的 44 个碱基序列 5'CTC GTGCCGAATTCGGCACGAGCTCGTGCCGAATTCGGCACGAG3' 可以看出。文中提出了这些长污染序列继续存在于数据库中的可能原因:(1)正链(+)的识别位点是单链自我退火的;(2)负链(-)的识别位点不仅是单链自我退火的,而且远离单链自我退火的正链,使得活性 EcoRI 酶二聚体无法形成以切割 5'G/AATTC3',这是其目标序列。作为一种可能的解决方案,建议至少依赖两个或三个独立的结果,例如使用独立的测序方法从独立实验室获得的序列。这些信息可能有助于开发能够检测/去除这些污染物的生物信息学工具,并推断为什么一些导致遗传疾病的受损序列逃避了细胞和生物体的分子质量控制机制的检测,不受控制地通过代际传递。