Chen Qingyu, Zobel Justin, Verspoor Karin
Department of Computing and Information Systems, The University of Melbourne, Parkville, VIC, 3010, Australia.
Database (Oxford). 2017 Jan 10;2017. doi: 10.1093/database/baw163. Print 2017.
GenBank, the EMBL European Nucleotide Archive and the DNA DataBank of Japan, known collectively as the International Nucleotide Sequence Database Collaboration or INSDC, are the three most significant nucleotide sequence databases. Their records are derived from laboratory work undertaken by different individuals, by different teams, with a range of technologies and assumptions and over a period of decades. As a consequence, they contain a great many duplicates, redundancies and inconsistencies, but neither the prevalence nor the characteristics of various types of duplicates have been rigorously assessed. Existing duplicate detection methods in bioinformatics only address specific duplicate types, with inconsistent assumptions; and the impact of duplicates in bioinformatics databases has not been carefully assessed, making it difficult to judge the value of such methods. Our goal is to assess the scale, kinds and impact of duplicates in bioinformatics databases, through a retrospective analysis of merged groups in INSDC databases. Our outcomes are threefold: (1) We analyse a benchmark dataset consisting of duplicates manually identified in INSDC-a dataset of 67 888 merged groups with 111 823 duplicate pairs across 21 organisms from INSDC databases - in terms of the prevalence, types and impacts of duplicates. (2) We categorize duplicates at both sequence and annotation level, with supporting quantitative statistics, showing that different organisms have different prevalence of distinct kinds of duplicate. (3) We show that the presence of duplicates has practical impact via a simple case study on duplicates, in terms of GC content and melting temperature. We demonstrate that duplicates not only introduce redundancy, but can lead to inconsistent results for certain tasks. Our findings lead to a better understanding of the problem of duplication in biological databases.Database URL: the merged records are available at https://cloudstor.aarnet.edu.au/plus/index.php/s/Xef2fvsebBEAv9w.
GenBank、欧洲分子生物学实验室核苷酸数据库(EMBL)和日本DNA数据库(DDBJ)统称为国际核苷酸序列数据库协作组织(INSDC),是三个最重要的核苷酸序列数据库。它们的记录源自不同个人、不同团队在几十年间运用一系列技术并基于各种假设开展的实验室工作。因此,这些数据库包含大量重复、冗余和不一致的信息,但各类重复信息的普遍程度和特征尚未得到严格评估。生物信息学中现有的重复检测方法仅针对特定类型的重复,假设也不一致;而且生物信息学数据库中重复信息的影响尚未得到仔细评估,难以判断这些方法的价值。我们的目标是通过对INSDC数据库中合并组的回顾性分析,评估生物信息学数据库中重复信息的规模、种类和影响。我们的成果有三个方面:(1)我们分析了一个基准数据集,该数据集由在INSDC中手动识别的重复项组成——这是一个包含67888个合并组的数据集,来自INSDC数据库的21种生物中有111823对重复序列,分析内容包括重复项的普遍程度、类型和影响。(2)我们在序列和注释层面将重复项分类,并提供支持性的定量统计数据,表明不同生物中不同类型重复项的普遍程度不同。(3)通过一个关于重复项的简单案例研究,我们表明重复项的存在在GC含量和熔解温度方面具有实际影响。我们证明重复项不仅会引入冗余,还可能导致某些任务的结果不一致。我们的研究结果有助于更好地理解生物数据库中的重复问题。数据库网址:合并后的记录可在https://cloudstor.aarnet.edu.au/plus/index.php/s/Xef2fvsebBEAv9w获取。