Department of Computer Science, Indiana University, Bloomington, IN 47405, USA.
Department of Biostatistics, University of Florida, Gainesville, FL 32603, USA.
Bioinformatics. 2023 Feb 3;39(2). doi: 10.1093/bioinformatics/btad060.
Small insertion and deletion (sindel) of human genome has an important implication for human disease. One important mechanism for non-coding sindel (nc-sindel) to have an impact on human diseases and phenotypes is through the regulation of gene expression. Nevertheless, current sequencing experiments may lack statistical power and resolution to pinpoint the functional sindel due to lower minor allele frequency or small effect size. As an alternative strategy, a supervised machine learning method can identify the otherwise masked functional sindels by predicting their regulatory potential directly. However, computational methods for annotating and predicting the regulatory sindels, especially in the non-coding regions, are underdeveloped.
By leveraging labeled nc-sindels identified by cis-expression quantitative trait loci analyses across 44 tissues in Genotype-Tissue Expression (GTEx), and a compilation of both generic functional annotations and large-scale epigenomic profiles, we develop TIssue-specific Variant Annotation for Non-coding indel (TIVAN-indel), which is a supervised computational framework for predicting non-coding regulatory sindels. As a result, we demonstrate that TIVAN-indel achieves the best prediction performance in both with-tissue prediction and cross-tissue prediction. As an independent evaluation, we train TIVAN-indel from the 'Whole Blood' tissue in GTEx and test the model using 15 immune cell types from an independent study named Database of Immune Cell Expression. Lastly, we perform an enrichment analysis for both true and predicted sindels in key regulatory regions such as chromatin interactions, open chromatin regions and histone modification sites, and find biologically meaningful enrichment patterns.
https://github.com/lichen-lab/TIVAN-indel.
Supplementary data are available at Bioinformatics online.
人类基因组的小插入和缺失(sindel)对人类疾病有重要影响。非编码 sindel(nc-sindel)影响人类疾病和表型的一个重要机制是通过调节基因表达。然而,由于较小的次要等位基因频率或较小的效应大小,当前的测序实验可能缺乏识别功能 sindel 的统计能力和分辨率。作为一种替代策略,监督机器学习方法可以通过直接预测其调节潜力来识别否则被掩盖的功能 sindel。然而,注释和预测调节 sindel 的计算方法,特别是在非编码区域,还不够发达。
通过利用 cis 表达数量性状基因座分析在 44 种组织中鉴定的标记 nc-sindel,以及通用功能注释和大规模表观基因组谱的综合,我们开发了 TIssue-specific Variant Annotation for Non-coding indel(TIVAN-indel),这是一种用于预测非编码调节 sindel 的监督计算框架。结果表明,TIVAN-indel 在组织内预测和跨组织预测中均具有最佳的预测性能。作为独立评估,我们从 GTEx 中的“全血”组织中训练 TIVAN-indel,并使用名为 Database of Immune Cell Expression 的独立研究中的 15 种免疫细胞类型测试模型。最后,我们对关键调节区域(如染色质相互作用、开放染色质区域和组蛋白修饰位点)中的真实和预测的 sindel 进行富集分析,并发现了有生物学意义的富集模式。
https://github.com/lichen-lab/TIVAN-indel。
补充数据可在 Bioinformatics 在线获取。