Department of Biology, Faculty of Science, Ferdowsi University of Mashhad, Mashhad, Iran.
Novel Diagnostics and Therapeutics Research Group, Institute of Biotechnology, Ferdowsi University of Mashhad, Mashhad, Iran.
Mol Genet Genomics. 2021 May;296(3):677-688. doi: 10.1007/s00438-021-01768-z. Epub 2021 Mar 18.
Contaminations in sequencing data, especially in reference genomes, lead to inevitable errors in downstream analyses. Similarly, presence of contaminants in transcriptomes, misrepresents the molecular basis of various interactions. In this study, we report the presence of a large number of plant transcriptomes contaminated with RNAs encoding POU domain proteins; a family of proteins that has not been reported in plants and fungi. Besides, our findings illustrated that there are four POU domain protein-coding sequences in the reference genome of Rhodamnia argentea. It turned out that the existing foreign fragments are related to arthropods that are considered as plant pests. We also identified two contaminated draft genomes, Humulus lupulus and Cannabis sativa that contained complete rDNA sequences originating from Tetranychus species. As a result, careful screening of sequencing data before releasing them in public databases or checking existing genomes for possible contaminations is recommended.
测序数据中的污染,特别是参考基因组中的污染,会导致下游分析中不可避免的错误。同样,转录组中污染物的存在,会错误地代表各种相互作用的分子基础。在这项研究中,我们报告了大量植物转录组受到编码 POUDOMAIN 蛋白的 RNA 的污染;POUDOMAIN 蛋白是一类在植物和真菌中尚未报道过的蛋白质。此外,我们的研究结果表明,在 Rhodamnia argentea 的参考基因组中存在四个编码 POUDOMAIN 蛋白的序列。事实证明,现有的外来片段与被认为是植物害虫的节肢动物有关。我们还鉴定出两种受污染的草案基因组,啤酒花和大麻,它们包含源自 Tetranychus 物种的完整 rDNA 序列。因此,建议在将测序数据发布到公共数据库之前或检查现有基因组是否存在可能的污染时,要仔细筛选测序数据。