Korning P G, Hebsgaard S M, Rouze P, Brunak S
Center for Biological Sequence Analysis, Technical University of Denmark, Lyngby, Denmark.
Nucleic Acids Res. 1996 Jan 15;24(2):316-20. doi: 10.1093/nar/24.2.316.
Data driven computational biology relies on the large quantities of genomic data stored in international sequence data banks. However, the possibilities are drastically impaired if the stored data is unreliable. During a project aiming to predict splice sites in the dicot Arabidopsis thaliana, we extracted a data set from the A.thaliana entries in GenBank. A number of simple 'sanity' checks, based on the nature of the data, revealed an alarmingly high error rate. More than 15% of the most important entries extracted did contain erroneous information. In addition, a number of entries had directly conflicting assignments of exons and introns, not stemming from alternative splicing. In a few cases the errors are due to mere typographical misprints, which may be corrected by comparison to the original papers, but errors caused by wrong assignments of splice sites from experimental data are the most common. It is proposed that the level of error correction should be increased and that gene structure sanity checks should be incorporated--also at the submitter level--to avoid or reduce the problem in the future. A non-redundant and error corrected subset of the data for A.thaliana is made available through anonymous FTP.
数据驱动的计算生物学依赖于存储在国际序列数据库中的大量基因组数据。然而,如果所存储的数据不可靠,那么其可能性将受到极大损害。在一个旨在预测双子叶植物拟南芥剪接位点的项目中,我们从GenBank中的拟南芥条目中提取了一个数据集。基于数据的性质进行的一些简单“合理性”检查显示,错误率高得惊人。提取的最重要条目中超过15%确实包含错误信息。此外,一些条目在外显子和内含子的分配上直接相互矛盾,并非源于可变剪接。在少数情况下,错误仅仅是由于排版错误,通过与原始论文比较可能会得到纠正,但实验数据中剪接位点错误分配导致的错误最为常见。建议提高错误纠正水平,并纳入基因结构合理性检查——也在提交者层面——以避免或减少未来的问题。通过匿名FTP可获取拟南芥数据的一个非冗余且经过错误纠正的子集。