Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.
Proteomics. 2024 Mar;24(6):e2300231. doi: 10.1002/pmic.202300231. Epub 2023 Jul 31.
Non-invasive diagnostics and therapies are crucial to prevent patients from undergoing painful procedures. Exosomal proteins can serve as important biomarkers for such advancements. In this study, we attempted to build a model to predict exosomal proteins. All models are trained, tested, and evaluated on a non-redundant dataset comprising 2831 exosomal and 2831 non-exosomal proteins, where no two proteins have more than 40% similarity. Initially, the standard similarity-based method Basic Local Alignment Search Tool (BLAST) was used to predict exosomal proteins, which failed due to low-level similarity in the dataset. To overcome this challenge, machine learning (ML) based models were developed using compositional and evolutionary features of proteins achieving an area under the receiver operating characteristics (AUROC) of 0.73. Our analysis also indicated that exosomal proteins have a variety of sequence-based motifs which can be used to predict exosomal proteins. Hence, we developed a hybrid method combining motif-based and ML-based approaches for predicting exosomal proteins, achieving a maximum AUROC of 0.85 and MCC of 0.56 on an independent dataset. This hybrid model performs better than presently available methods when assessed on an independent dataset. A web server and a standalone software ExoProPred (https://webs.iiitd.edu.in/raghava/exopropred/) have been created to help scientists predict and discover exosomal proteins and find functional motifs present in them.
非侵入性诊断和治疗对于防止患者接受痛苦的程序至关重要。外泌体蛋白可以作为此类进展的重要生物标志物。在这项研究中,我们试图建立一个预测外泌体蛋白的模型。所有模型均在一个不包含冗余数据的数据集上进行训练、测试和评估,该数据集包含 2831 个外泌体蛋白和 2831 个非外泌体蛋白,其中没有两个蛋白的相似度超过 40%。最初,使用标准的基于相似性的方法 Basic Local Alignment Search Tool (BLAST) 来预测外泌体蛋白,但由于数据集的相似度较低,该方法失败。为了克服这一挑战,我们使用蛋白质的组成和进化特征开发了基于机器学习 (ML) 的模型,获得了接收器操作特性 (AUROC) 的 0.73。我们的分析还表明,外泌体蛋白具有多种基于序列的基序,可用于预测外泌体蛋白。因此,我们开发了一种结合基于基序和基于 ML 的方法的混合方法,用于预测外泌体蛋白,在独立数据集上获得了 0.85 的最大 AUROC 和 0.56 的 MCC。与目前可用的方法相比,该混合模型在独立数据集上的评估表现更好。创建了一个网络服务器和一个独立的软件 ExoProPred(https://webs.iiitd.edu.in/raghava/exopropred/),以帮助科学家预测和发现外泌体蛋白,并找到它们中存在的功能基序。