Panchin Alexander Y, Spirin Sergey A, Lukyanov Sergey A, Lebedev Yuri B, Panchin Yuri V
Shemyakin and Ovchinnikov Institute of Bioorganic Chemistry, Moscow, Russia.
J Bioinform Comput Biol. 2008 Aug;6(4):759-73. doi: 10.1142/s0219720008003709.
Expressed sequence tags (ESTs) represent 500-1000-bp-long sequences corresponding to mRNAs derived from different sources (cell lines, tissues, etc.). The human EST database contains over 8,000,000 sequences, with over 4,000,000,000 total nucleotides. RNA molecules are transcribed from a genomic DNA template; therefore, all ESTs should match corresponding genomes. Nevertheless, we have found in the human EST database approximately 11,000 ESTs not matching sequences in the human genome database. The presence of "trash" ESTs (TESTs) in the EST database could result from DNA or RNA contamination of the laboratory equipment, tissues, or cell lines. TESTs could also represent sequences from unidentified human genes or from species inhabiting the human body. Here, we attempt to identify the sources of human EST database contaminations. In particular, we discuss systematic contamination of the mammalian EST databases with sequences of plants.
表达序列标签(EST)代表长度为500 - 1000碱基对的序列,这些序列对应于来自不同来源(细胞系、组织等)的mRNA。人类EST数据库包含超过800万个序列,总核苷酸数超过40亿个。RNA分子是从基因组DNA模板转录而来的;因此,所有EST都应与相应的基因组匹配。然而,我们在人类EST数据库中发现了大约11000个EST与人类基因组数据库中的序列不匹配。EST数据库中“垃圾”EST(TEST)的存在可能是由于实验室设备、组织或细胞系的DNA或RNA污染所致。TEST也可能代表来自未鉴定的人类基因或寄生于人体的物种的序列。在这里,我们试图确定人类EST数据库污染的来源。特别是,我们讨论了植物序列对哺乳动物EST数据库的系统性污染。