Bioinformatics Institute, Agency for Science, Technology, and Research (A*Star), 30 Biopolis Street, #07-01 Matrix, Singapore, 138671, Republic of Singapore.
School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi, India.
BMC Bioinformatics. 2020 Oct 19;21(1):466. doi: 10.1186/s12859-020-03794-x.
Homology based methods are one of the most important and widely used approaches for functional annotation of high-throughput microbial genome data. A major limitation of these methods is the absence of well-characterized sequences for certain functions. The non-homology methods based on the context and the interactions of a protein are very useful for identifying missing metabolic activities and functional annotation in the absence of significant sequence similarity. In the current work, we employ both homology and context-based methods, incrementally, to identify local holes and chokepoints, whose presence in the Mycobacterium tuberculosis genome is indicated based on its interaction with known proteins in a metabolic network context, but have not been annotated. We have developed two computational procedures using network theory to identify orphan enzymes ('Hole finding protocol') coupled with the identification of candidate proteins for the predicted orphan enzyme ('Hole filling protocol'). We propose an integrated interaction score based on scores from the STRING database to identify candidate protein sequences for the orphan enzymes from M. tuberculosis, as a case study, which are most likely to perform the missing function.
The application of an automated homology-based enzyme identification protocol, ModEnzA, on M. tuberculosis genome yielded 56 novel enzyme predictions. We further predicted 74 putative local holes, 6 choke points, and 3 high confidence local holes in the genome using 'Hole finding protocol'. The 'Hole-filling protocol' was validated on the E. coli genome using artificial in-silico enzyme knockouts where our method showed 25% increased accuracy, compared to other methods, in assigning the correct sequence for the knocked-out enzyme amongst the top 10 ranks. The method was further validated on 8 additional genomes.
We have developed methods that can be generalized to augment homology-based annotation to identify missing enzyme coding genes and to predict a candidate protein for them. For pathogens such as M. tuberculosis, this work holds significance in terms of increasing the protein repertoire and thereby, the potential for identifying novel drug targets.
基于同源性的方法是对高通量微生物基因组数据进行功能注释的最重要和最广泛使用的方法之一。这些方法的一个主要局限性是某些功能缺乏特征良好的序列。基于蛋白质的上下文和相互作用的非同源性方法对于在没有显著序列相似性的情况下识别缺失的代谢活性和功能注释非常有用。在当前的工作中,我们逐步采用同源性和基于上下文的方法来识别局部漏洞和瓶颈,根据其在代谢网络上下文中与已知蛋白质的相互作用,这些漏洞和瓶颈在结核分枝杆菌基因组中存在,但尚未被注释。我们使用网络理论开发了两种计算程序来识别孤儿酶(“发现漏洞协议”),并结合预测的孤儿酶候选蛋白(“填补漏洞协议”)。我们提出了一种基于 STRING 数据库得分的综合相互作用得分,以识别结核分枝杆菌孤儿酶的候选蛋白序列,作为一个案例研究,这些候选蛋白序列最有可能执行缺失的功能。
应用自动化基于同源性的酶识别协议 ModEnzA 对结核分枝杆菌基因组进行分析,得到了 56 个新的酶预测。我们进一步使用“发现漏洞协议”预测了基因组中 74 个可能的局部漏洞、6 个瓶颈和 3 个高置信度局部漏洞。“填补漏洞协议”在大肠杆菌基因组上进行了验证,使用人工在模拟酶敲除中,与其他方法相比,我们的方法在将敲除酶的正确序列分配给前 10 个排名中的正确序列时,准确性提高了 25%。该方法进一步在 8 个额外的基因组上进行了验证。
我们开发的方法可以推广到基于同源性的注释,以识别缺失的酶编码基因并预测它们的候选蛋白。对于结核分枝杆菌等病原体,这项工作在增加蛋白质组学和从而识别新的药物靶标方面具有重要意义。