Institute of Physical Science and Information Technology, School of Computer Science and Technology, Anhui University, Hefei, Anhui, China.
School of Life Sciences, Anhui University, Hefei, Anhui, China.
Brief Bioinform. 2019 Sep 27;20(5):1925-1933. doi: 10.1093/bib/bby059.
While recent advances in next-generation sequencing technologies have enabled the creation of a multitude of databases in cancer genomic research, there is no comprehensive database focusing on the annotation of driver indels (insertions and deletions) yet. Therefore, we have developed the database of Cancer driver InDels (dbCID), which is a collection of known coding indels that likely to be engaged in cancer development, progression or therapy. dbCID contains experimentally supported and putative driver indels derived from manual curation of literature and is freely available online at http://bioinfo.ahu.edu.cn:8080/dbCID. Using the data deposited in dbCID, we summarized features of driver indels in four levels (gene, DNA, transcript and protein) through comparing with putative neutral indels. We found that most of the genes containing driver indels in dbCID are known cancer genes playing a role in tumorigenesis. Contrary to the expectation, the sequences affected by driver frameshift indels are not larger than those by neutral ones. In addition, the frameshift and inframe driver indels prefer to disrupt high-conservative regions both in DNA sequences and protein domains. Finally, we developed a computational method for discriminating cancer driver from neutral frameshift indels based on the deposited data in dbCID. The proposed method outperformed other widely used non-cancer-specific predictors on an external test set, which demonstrated the usefulness of the data deposited in dbCID. We hope dbCID will be a benchmark for improving and evaluating prediction algorithms, and the characteristics summarized here may assist with investigating the mechanism of indel-cancer association.
虽然新一代测序技术的最新进展使得在癌症基因组研究中创建了众多数据库,但目前还没有一个专门针对驱动插入缺失(indels)注释的综合数据库。因此,我们开发了癌症驱动 indels 数据库(dbCID),这是一个包含已知编码 indels 的集合,这些 indels 可能参与癌症的发生、发展或治疗。dbCID 包含通过文献手动整理得到的实验支持的和推测的驱动 indels,并可在 http://bioinfo.ahu.edu.cn:8080/dbCID 上免费获取。使用 dbCID 中存储的数据,我们通过与推测的中性 indels 进行比较,总结了驱动 indels 在四个层次(基因、DNA、转录本和蛋白质)上的特征。我们发现,dbCID 中包含驱动 indels 的大多数基因都是已知的癌症基因,在肿瘤发生中发挥作用。与预期相反,受驱动移码 indels 影响的序列并不大于受中性 indels 影响的序列。此外,移码和框架内驱动 indels 倾向于破坏 DNA 序列和蛋白质结构域中高保守区域。最后,我们基于 dbCID 中存储的数据开发了一种区分癌症驱动和中性移码 indels 的计算方法。该方法在外部测试集上的表现优于其他广泛使用的非癌症特异性预测器,证明了 dbCID 中存储的数据的有用性。我们希望 dbCID 将成为改进和评估预测算法的基准,并且这里总结的特征可能有助于研究 indels 与癌症关联的机制。