Centre for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, NO-0316 Oslo, Norway.
Faculty of Biology, Johannes Gutenberg University Mainz, Hans-Dieter-Husch-Weg 15, 55128 Mainz, Germany.
Nucleic Acids Res. 2019 Dec 2;47(21):10994-11006. doi: 10.1093/nar/gkz841.
The widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with 'ready-to-use' deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses. As a case study, we provide examples of the Atlantic cod genome, whose sequencing and assembly were hindered by a particularly high prevalence of tandem repeats. We complement this case study with examples from other species, where mis-annotations and sequencing errors have propagated into protein databases. With this review, we aim to raise the awareness level within the community of database users, and alert scientists working in the underlying workflow of database creation that the data they omit or improperly assemble may well contain important biological information valuable to others.
在生命之树中,生物体的基因组中广泛存在重复的 DNA 片段,这给测序、基因组组装和基因及蛋白质的自动注释带来了根本性的挑战。这个多层次的问题可能导致基因组和蛋白质数据库中的错误,而这些错误往往未被识别或承认。因此,使用具有重复区域的序列的最终用户面临着“即用型”的已存储数据,其可信度难以确定,更不用说量化了。在这里,我们回顾了与串联重复序列相关的问题,这些问题源自测序-组装-注释-存储工作流程的不同阶段,并且可能在公共数据库存储库中扩散,影响所有下游分析。作为一个案例研究,我们提供了大西洋鳕鱼基因组的例子,其测序和组装受到了特别高的串联重复序列的阻碍。我们用其他物种的例子来补充这个案例研究,其中错误注释和测序错误已经传播到蛋白质数据库中。通过本综述,我们旨在提高数据库用户群体的意识水平,并提醒在数据库创建基础工作流程中工作的科学家,他们忽略或不当组装的数据很可能包含对他人有价值的重要生物学信息。