Suppr超能文献

一种预测转录因子的混合方法。

A hybrid approach for predicting transcription factors.

作者信息

Patiyal Sumeet, Tiwari Palak, Ghai Mohit, Dhapola Aman, Dhall Anjali, Raghava Gajendra P S

机构信息

Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.

出版信息

Front Bioinform. 2024 Jul 25;4:1425419. doi: 10.3389/fbinf.2024.1425419. eCollection 2024.

Abstract

Transcription factors are essential DNA-binding proteins that regulate the transcription rate of several genes and control the expression of genes inside a cell. The prediction of transcription factors with high precision is important for understanding biological processes such as cell differentiation, intracellular signaling, and cell-cycle control. In this study, we developed a hybrid method that combines alignment-based and alignment-free methods for predicting transcription factors with higher accuracy. All models have been trained, tested, and evaluated on a large dataset that contains 19,406 transcription factors and 523,560 non-transcription factor protein sequences. To avoid biases in evaluation, the datasets were divided into training and validation/independent datasets, where 80% of the data was used for training, and the remaining 20% was used for external validation. In the case of alignment-free methods, models were developed using machine learning techniques and the composition-based features of a protein. Our best alignment-free model obtained an AUC of 0.97 on an independent dataset. In the case of the alignment-based method, we used BLAST at different cut-offs to predict the transcription factors. Although the alignment-based method demonstrated excellent performance, it was unable to cover all transcription factors due to instances of no hits. To combine the strengths of both methods, we developed a hybrid method that combines alignment-free and alignment-based methods. In the hybrid method, we added the scores of the alignment-free and alignment-based methods and achieved a maximum AUC of 0.99 on the independent dataset. The method proposed in this study performs better than existing methods. We incorporated the best models in the webserver/Python Package Index/standalone package of "TransFacPred" (https://webs.iiitd.edu.in/raghava/transfacpred).

摘要

转录因子是一类重要的DNA结合蛋白,可调节多个基因的转录速率并控制细胞内基因的表达。高精度预测转录因子对于理解细胞分化、细胞内信号传导和细胞周期调控等生物学过程至关重要。在本研究中,我们开发了一种混合方法,该方法结合了基于比对和不基于比对的方法,以更高的准确性预测转录因子。所有模型均在一个包含19,406个转录因子和523,560个非转录因子蛋白质序列的大型数据集上进行了训练、测试和评估。为避免评估偏差,数据集被分为训练集和验证/独立数据集,其中80%的数据用于训练,其余20%用于外部验证。对于不基于比对的方法,使用机器学习技术和蛋白质的基于组成的特征开发模型。我们最佳的不基于比对的模型在独立数据集上的AUC为0.97。对于基于比对的方法,我们使用不同截止值的BLAST来预测转录因子。尽管基于比对的方法表现出色,但由于无命中实例,它无法涵盖所有转录因子。为了结合两种方法的优势,我们开发了一种结合不基于比对和基于比对方法的混合方法。在混合方法中,我们将不基于比对和基于比对方法的分数相加,在独立数据集上实现了最高0.99的AUC。本研究提出的方法比现有方法表现更好。我们将最佳模型整合到了“TransFacPred”的网络服务器/ Python包索引/独立包中(https://webs.iiitd.edu.in/raghava/transfacpred)。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/84f7/11306938/9a15bb8a6f49/fbinf-04-1425419-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验