Suppr超能文献

ProSol-multi:基于氨基酸多级相关性和判别性分布的蛋白质溶解度预测

ProSol-multi: Protein solubility prediction via amino acids multi-level correlation and discriminative distribution.

作者信息

Ghafoor Hina, Asim Muhammad Nabeel, Ibrahim Muhammad Ali, Dengel Andreas

机构信息

Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany.

German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany.

出版信息

Heliyon. 2024 Aug 22;10(17):e36041. doi: 10.1016/j.heliyon.2024.e36041. eCollection 2024 Sep 15.

Abstract

Protein solubility prediction is useful for the careful selection of highly effective candidate proteins for drug development. In recombinant proteins synthesis, solubility prediction is valuable for optimizing key protein characteristics, including stability, functionality, and ease of purification. It contains valuable information about potential biomarkers or therapeutic targets and helps in early forecasting of neurodegenerative diseases, cancer, and cardiovascular disorders. Traditional wet-lab experimental protein solubility prediction approaches are error-prone, time-consuming, and costly. Researchers harnessed the competence of Artificial Intelligence approaches for replacing experimental approaches with computational predictors. These predictors inferred the solubility of proteins by analyzing amino acids distributions in raw protein sequences. There is still a lot of room for the development of robust computational predictors because existing predictors remain fail in extracting comprehensive discriminative distribution of amino acids. To more precisely discriminate soluble proteins from insoluble proteins, this paper presents ProSol-Multi predictor that makes use of a novel MLCDE encoder and Random Forest classifier. MLCDE encoder transforms protein sequences into informative statistical vectors by capturing amino acids multi-level correlation and discriminative distribution within raw protein sequences. The performance of proposed encoder is evaluated against 56 existing protein sequence encoding methods on a widely used protein solubility prediction benchmark dataset under two different experimental settings namely intrinsic and extrinsic. Intrinsic evaluation reveals that from all sequence encoders, proposed MLCDE encoder manages to generate non-overlapping clusters of soluble and insoluble classes. In extrinsic evaluation, 10 machine learning classifiers achieve better performance with proposed MLCDE encoder as compared to 56 existing protein sequence encoders. Moreover, across 4 public benchmark datasets, proposed ProSol-Multi predictor outshines 20 existing predictors by an average accuracy of 3%, MCC and AU-ROC of 2%. ProSol-Multi interactive web application is available at https://sds_genetic_analysis.opendfki.de/ProSol-Multi.

摘要

蛋白质溶解度预测对于精心挑选用于药物开发的高效候选蛋白质很有用。在重组蛋白质合成中,溶解度预测对于优化关键蛋白质特性(包括稳定性、功能性和纯化的难易程度)很有价值。它包含有关潜在生物标志物或治疗靶点的有价值信息,并有助于早期预测神经退行性疾病、癌症和心血管疾病。传统的湿实验室实验性蛋白质溶解度预测方法容易出错、耗时且成本高昂。研究人员利用人工智能方法的能力,用计算预测器取代实验方法。这些预测器通过分析原始蛋白质序列中的氨基酸分布来推断蛋白质的溶解度。强大的计算预测器的开发仍有很大空间,因为现有的预测器在提取氨基酸的全面判别分布方面仍然失败。为了更精确地区分可溶性蛋白质和不溶性蛋白质,本文提出了ProSol-Multi预测器,它使用了一种新颖的MLCDE编码器和随机森林分类器。MLCDE编码器通过捕获原始蛋白质序列中的氨基酸多级相关性和判别分布,将蛋白质序列转换为信息丰富的统计向量。在两种不同的实验设置(即内在和外在)下,在一个广泛使用的蛋白质溶解度预测基准数据集上,将所提出编码器的性能与56种现有的蛋白质序列编码方法进行了评估。内在评估表明,在所有序列编码器中,所提出的MLCDE编码器成功生成了可溶性和不溶性类别的不重叠簇。在外在评估中,与56种现有的蛋白质序列编码器相比,10种机器学习分类器使用所提出的MLCDE编码器表现出更好的性能。此外,在4个公共基准数据集上,所提出的ProSol-Multi预测器比20种现有的预测器表现更优,平均准确率高3%,马修斯相关系数和曲线下面积高2%。ProSol-Multi交互式网络应用程序可在https://sds_genetic_analysis.opendfki.de/ProSol-Multi获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0d2d/11401092/decfb050f735/gr001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验